aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAge
* Hook into kernel/exit.c to force exiting RT tasks into best-effort modeBjoern Brandenburg2014-10-21
|
* Protect LITMUS^RT tasks from re-nicingBjoern Brandenburg2014-10-21
| | | | | Assigning a nice value to LITMUS^RT tasks is meaningless. Bail out early.
* Don't call set_tsk_need_resched() on remote LITMUS^RT taskBjoern Brandenburg2014-10-21
| | | | | This patch fixes a BUG_ON() in litmus/preempt.c:33 reported by Felipe Cerqueira & Manohar Vanga.
* Hook into rt scheduling class to protect LITMUS^RT tasksBjoern Brandenburg2014-10-21
| | | | | | The rt scheduling class thinks it's the highest-priority scheduling class around. It is not in LITMUS^RT. Don't go preempting remote cores that run SCHED_LITMUS tasks.
* Don't trigger load balancer in scheduler tick for LITMUS^RTBjoern Brandenburg2014-10-21
|
* Cache 'prev' in LITMUS^RT runqueueBjoern Brandenburg2014-10-21
| | | | | | LITMUS^RT plugins like to know who 'prev' is. pick_next_task() doesn't expose that info, so we just cache prev in the runqueue. Could robably be replaced by looking at 'current' instead.
* Hook into finish_switch()Bjoern Brandenburg2014-10-21
| | | | To keep track of stack usage and to notify plugin, if necessary.
* Reset SCHED_LITMUS scheduling class on forkBjoern Brandenburg2014-10-21
|
* Block sched_setaffinity() for SCHED_LITMUS tasksBjoern Brandenburg2014-10-21
|
* Integrate LITMUS^RT scheduling class with sched_setschedulerBjoern Brandenburg2014-10-21
|
* Disable cut-to-CFS optimization in Linux schedulerBjoern Brandenburg2014-10-21
| | | | | Global plugins require that the plugin be called even if there currently is no real-time task executing on the local core.
* Integrate LITMUS^RT with try_to_wake_up() pathBjoern Brandenburg2014-10-21
|
* Make LITMUS^RT scheduling class the highest-priority scheduling classBjoern Brandenburg2014-10-21
| | | | | | Needs to be above stop_machine_class for legacy reasons; the main plugins were developed before stop_machine_class was introduced and assume that they are the highest-priority scheduling class.
* Add LITMUS^RT scheduling class in kernel/sched/MakefileBjoern Brandenburg2014-10-21
|
* Introduce LITMUS^RT runqueue dummy into struct rqBjoern Brandenburg2014-10-21
|
* Integrate SRP ceiling blocking callback with Linux schedulerBjoern Brandenburg2014-10-21
| | | | Check whether a suspension is required at end of schedule().
* Hookup sched_trace_XXX() tracing in Linux schedulerBjoern Brandenburg2014-10-21
| | | | This patch adds context switch tracing to the main Linux scheduler.
* Augment rt_task() with is_realtime()Bjoern Brandenburg2014-10-21
| | | | | | | | | | | Whenever the kernel checks for rt_task() to avoid delaying real-time tasks, we want it to also not delay LITMUS^RT tasks. Hence, most calls to rt_task() should be matched by an equivalent call to is_realtime(). Notably, this affects the implementations of select() and nanosleep(), which use timer_slack_ns when setting up timers for non-real-time tasks.
* Hook into fork(), exec(), and exit()Bjoern Brandenburg2014-10-21
| | | | | Allow LITMUS^RT to do some work when a process is created or terminated.
* Call sched_state_task_picked() from pick_next_task_stop()Bjoern Brandenburg2014-10-21
| | | | | Otherwise, the scheduler state machine becomes confused (and goes into a rescheduling loop) when stop-machine is triggered.
* Integrate preemption state machine with Linux schedulerBjoern Brandenburg2014-10-21
| | | | Track when a processor is going to schedule "soon".
* Add LITMUS^RT core implementationBjoern Brandenburg2014-10-21
| | | | | | | | | | | | | This patch adds the core of LITMUS^RT: - library functionality (heaps, rt_domain, prioritization, etc.) - budget enforcement logic - job management - system call backends - virtual devices (control page, etc.) - scheduler plugin API (and dummy plugin) This code compiles, but is not yet integrated with the rest of Linux.
* Add hrtimer_start_on() supportFelipe Cerqueira2014-10-21
| | | | | | This patch adds hrtimer_start_on(), which allows arming timers on remote CPUs. This is needed to avoided timer interrupts on "shielded" CPUs and is also useful for implementing semi-partitioned schedulers.
* Add TRACE() debug tracing supportBjoern Brandenburg2014-10-21
| | | | This patch adds the infrastructure for the TRACE() debug macro.
* Add SCHED, SCHED2, TICK, and CXS overhead tracepointsBjoern Brandenburg2014-10-21
| | | | | | | This patch integrates the overhead tracepoints into the Linux scheduler that are compatible with plain vanilla Linux (i.e., not specific to LITMUS^RT plugins). This can be used to measure the overheads of an otherwise unmodified kernel.
* Integrate ft_irq_fired() with LinuxBjoern Brandenburg2014-10-21
| | | | | This patch hooks up Feather-Trace's ft_irq_fired() handler with Linux's interrupt handling infrastructure.
* ENGR00313685-3 of/irq: simplify args to irq_create_of_mappingGrant Likely2014-05-16
| | | | | | | | | | | | | | | | | | | | | | | commit e6d30ab1e7d1281784672c0fc2ffa385cfb7279e upstream. All the callers of irq_create_of_mapping() pass the contents of a struct of_phandle_args structure to the function. Since all the callers already have an of_phandle_args pointer, why not pass it directly to irq_create_of_mapping()? Signed-off-by: Grant Likely <grant.likely@linaro.org> Acked-by: Michal Simek <monstr@monstr.eu> Acked-by: Tony Lindgren <tony@atomide.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Shawn Guo <shawn.guo@freescale.com> Conflicts: arch/arm/mach-integrator/pci_v3.c arch/mips/pci/pci-rt3883.c kernel/irq/irqdomain.c
* futex: update documentation for ordering guaranteesDavidlohr Bueso2014-04-17
| | | | | | | | | | | | | | Commits 11d4616bd07f ("futex: revert back to the explicit waiter counting code") and 69cd9eba3886 ("futex: avoid race between requeue and wake") changed some of the finer details of how we think about futexes. One was a late fix and the other a consequence of overlooking the whole requeuing logic. The first change caused our documentation to be incorrect, and the second made us aware that we need to explicitly add more details to it. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: avoid race between requeue and wakeLinus Torvalds2014-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jan Stancek reported: "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP) occasionally fails, because some threads fail to wake up. Testcase creates 5 threads, which are all waiting on same condition. Main thread then calls pthread_cond_broadcast() without holding mutex, which calls: futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..) This immediately wakes up single thread A, which unlocks mutex and tries to wake up another thread: futex(uaddr2, FUTEX_WAKE_PRIVATE, 1) If thread A manages to call futex_wake() before any waiters are requeued for uaddr2, no other thread is woken up" The ordering constraints for the hash bucket waiter counting are that the waiter counts have to be incremented _before_ getting the spinlock (because the spinlock acts as part of the memory barrier), but the "requeue" operation didn't honor those rules, and nobody had even thought about that case. This fairly simple patch just increments the waiter count for the target hash bucket (hb2) when requeing a futex before taking the locks. It then decrements them again after releasing the lock - the code that actually moves the futex(es) between hash buckets will do the additional required waiter count housekeeping. Reported-and-tested-by: Jan Stancek <jstancek@redhat.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # 3.14 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: revert back to the explicit waiter counting codeLinus Torvalds2014-04-17
| | | | | | | | | | | | | | | | | | | | | Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid taking the hb->lock if there's nothing to wake up") causes java threads getting stuck on futexes when runing specjbb on a power7 numa box. The cause appears to be that the powerpc spinlocks aren't using the same ticket lock model that we use on x86 (and other) architectures, which in turn result in the "spin_is_locked()" test in hb_waiters_pending() occasionally reporting an unlocked spinlock even when there are pending waiters. So this reinstates Davidlohr Bueso's original explicit waiter counting code, which I had convinced Davidlohr to drop in favor of figuring out the pending waiters by just using the existing state of the spinlock and the wait queue. Reported-and-tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Original-code-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() testHeiko Carstens2014-04-17
| | | | | | | | | | | | | | If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there is no runtime check necessary, allow to skip the test within futex_init(). This allows to get rid of some code which would always give the same result, and also allows the compiler to optimize a couple of if statements away. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Finn Thain <fthain@telegraphics.com.au> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* futexes: Fix futex_hashsize initializationHeiko Carstens2014-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | "futexes: Increase hash table size for better performance" introduces a new alloc_large_system_hash() call. alloc_large_system_hash() however may allocate less memory than requested, e.g. limited by MAX_ORDER. Hence pass a pointer to alloc_large_system_hash() which will contain the hash shift when the function returns. Afterwards correctly set futex_hashsize. Fixes a crash on s390 where the requested allocation size was 4MB but only 1MB was allocated. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris Signed-off-by: Ingo Molnar <mingo@kernel.org>
* futexes: Avoid taking the hb->lock if there's nothing to wake upDavidlohr Bueso2014-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In futex_wake() there is clearly no point in taking the hb->lock if we know beforehand that there are no tasks to be woken. While the hash bucket's plist head is a cheap way of knowing this, we cannot rely 100% on it as there is a racy window between the futex_wait call and when the task is actually added to the plist. To this end, we couple it with the spinlock check as tasks trying to enter the critical region are most likely potential waiters that will be added to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters. Furthermore, the futex ordering guarantees are preserved, ensuring that waiters either observe the changed user space value before blocking or is woken by a concurrent waker. For wakers, this is done by relying on the barriers in get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(), we explicitly add them through a new futex_get_mm function. For waiters we rely on the fact that spin_lock calls already update the head counter, so spinners are visible even if the lock hasn't been acquired yet. For more details please refer to the updated comments in the code and related discussion: https://lkml.org/lkml/2013/11/26/556 Special thanks to tglx for careful review and feedback. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* futexes: Document multiprocessor ordering guaranteesThomas Gleixner2014-04-17
| | | | | | | | | | | | | | | | | | | | | | That's essential, if you want to hack on futexes. Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* futexes: Increase hash table size for better performanceDavidlohr Bueso2014-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the futex global hash table suffers from its fixed, smallish (for today's standards) size of 256 entries, as well as its lack of NUMA awareness. Large systems, using many futexes, can be prone to high amounts of collisions; where these futexes hash to the same bucket and lead to extra contention on the same hb->lock. Furthermore, cacheline bouncing is a reality when we have multiple hb->locks residing on the same cacheline and different futexes hash to adjacent buckets. This patch keeps the current static size of 16 entries for small systems, or otherwise, 256 * ncpus (or larger as we need to round the number to a power of 2). Note that this number of CPUs accounts for all CPUs that can ever be available in the system, taking into consideration things like hotpluging. While we do impose extra overhead at bootup by making the hash table larger, this is a one time thing, and does not shadow the benefits of this patch. Furthermore, as suggested by tglx, by cache aligning the hash buckets we can avoid access across cacheline boundaries and also avoid massive cache line bouncing if multiple cpus are hammering away at different hash buckets which happen to reside in the same cache line. Also, similar to other core kernel components (pid, dcache, tcp), by using alloc_large_system_hash() we benefit from its NUMA awareness and thus the table is distributed among the nodes instead of in a single one. For a custom microbenchmark that pounds on the uaddr hashing -- making the wait path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server: +---------+--------------------+------------------------+-----------------------+-------------------------------+ | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) | +---------+--------------------+------------------------+-----------------------+-------------------------------+ |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             | |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             | |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             | |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             | |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             | |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              | +---------+--------------------+------------------------+-----------------------+-------------------------------+ Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Waiman Long <Waiman.Long@hp.com> Reviewed-and-tested-by: Jason Low <jason.low2@hp.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* futex: Use freezable blocking callColin Cross2014-04-17
| | | | | | | | | | | | | | | | | | | | | | | | Avoid waking up every thread sleeping in a futex_wait call during suspend and resume by calling a freezable blocking call. Previous patches modified the freezer to avoid sending wakeups to threads that are blocked in freezable blocking calls. This call was selected to be converted to a freezable call because it doesn't hold any locks or release any resources when interrupted that might be needed by another freezing task or a kernel driver during suspend, and is a common site where idle userspace tasks are blocked. Signed-off-by: Colin Cross <ccross@android.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: arve@android.com Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* futexes: Clean up various detailsJason Low2014-04-17
| | | | | | | | | | | | | | | | | | | | | | - Remove unnecessary head variables. - Delete unused parameter in queue_unlock(). Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* futex: move user address verification up to common codeLinus Torvalds2014-04-17
| | | | | | | | | | | | | | | | | | | | | | When debugging the read-only hugepage case, I was confused by the fact that get_futex_key() did an access_ok() only for the non-shared futex case, since the user address checking really isn't in any way specific to the private key handling. Now, it turns out that the shared key handling does effectively do the equivalent checks inside get_user_pages_fast() (it doesn't actually check the address range on x86, but does check the page protections for being a user page). So it wasn't actually a bug, but the fact that we treat the address differently for private and shared futexes threw me for a loop. Just move the check up, so that it gets done for both cases. Also, use the 'rw' parameter for the type, even if it doesn't actually matter any more (it's a historical artifact of the old racy i386 "page faults from kernel space don't check write protections"). Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ENGR00273073-1 arm: add cpu idle notification callbackAnson Huang2014-04-16
| | | | | | | | | | | | | | | Some modules may need to know cpu idle status and take actions before and after cpu idle, so we can add notification callback when enter/exit cpu idle, then modules only need to register this notification callback, everytime cpu enter/exit idle, the callback chain will be executed. Currently only cpufreq interactive governor use this notification, as it wants to save power, the timers of interactive governor are only enabled when cpu is not in idle. Signed-off-by: Anson Huang <b20788@freescale.com>
* timekeeping: Avoid possible deadlock from clock_was_set_delayedJohn Stultz2014-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6fdda9a9c5db367130cf32df5d6618d08b89f46a upstream. As part of normal operaions, the hrtimer subsystem frequently calls into the timekeeping code, creating a locking order of hrtimer locks -> timekeeping locks clock_was_set_delayed() was suppoed to allow us to avoid deadlocks between the timekeeping the hrtimer subsystem, so that we could notify the hrtimer subsytem the time had changed while holding the timekeeping locks. This was done by scheduling delayed work that would run later once we were out of the timekeeing code. But unfortunately the lock chains are complex enoguh that in scheduling delayed work, we end up eventually trying to grab an hrtimer lock. Sasha Levin noticed this in testing when the new seqlock lockdep enablement triggered the following (somewhat abrieviated) message: [ 251.100221] ====================================================== [ 251.100221] [ INFO: possible circular locking dependency detected ] [ 251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted [ 251.101967] ------------------------------------------------------- [ 251.101967] kworker/10:1/4506 is trying to acquire lock: [ 251.101967] (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70 [ 251.101967] [ 251.101967] but task is already holding lock: [ 251.101967] (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70 [ 251.101967] [ 251.101967] which lock already depends on the new lock. [ 251.101967] [ 251.101967] [ 251.101967] the existing dependency chain (in reverse order) is: [ 251.101967] -> #5 (hrtimer_bases.lock#11){-.-...}: [snipped] -> #4 (&rt_b->rt_runtime_lock){-.-...}: [snipped] -> #3 (&rq->lock){-.-.-.}: [snipped] -> #2 (&p->pi_lock){-.-.-.}: [snipped] -> #1 (&(&pool->lock)->rlock){-.-...}: [ 251.101967] [<ffffffff81194803>] validate_chain+0x6c3/0x7b0 [ 251.101967] [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580 [ 251.101967] [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0 [ 251.101967] [<ffffffff84398500>] _raw_spin_lock+0x40/0x80 [ 251.101967] [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0 [ 251.101967] [<ffffffff81154168>] queue_work_on+0x98/0x120 [ 251.101967] [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30 [ 251.101967] [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160 [ 251.101967] [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70 [ 251.101967] [<ffffffff843a4b49>] ia32_sysret+0x0/0x5 [ 251.101967] -> #0 (timekeeper_seq){----..}: [snipped] [ 251.101967] other info that might help us debug this: [ 251.101967] [ 251.101967] Chain exists of: timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11 [ 251.101967] Possible unsafe locking scenario: [ 251.101967] [ 251.101967] CPU0 CPU1 [ 251.101967] ---- ---- [ 251.101967] lock(hrtimer_bases.lock#11); [ 251.101967] lock(&rt_b->rt_runtime_lock); [ 251.101967] lock(hrtimer_bases.lock#11); [ 251.101967] lock(timekeeper_seq); [ 251.101967] [ 251.101967] *** DEADLOCK *** [ 251.101967] [ 251.101967] 3 locks held by kworker/10:1/4506: [ 251.101967] #0: (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530 [ 251.101967] #1: (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530 [ 251.101967] #2: (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70 [ 251.101967] [ 251.101967] stack backtrace: [ 251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 [ 251.101967] Workqueue: events clock_was_set_work So the best solution is to avoid calling clock_was_set_delayed() while holding the timekeeping lock, and instead using a flag variable to decide if we should call clock_was_set() once we've released the locks. This works for the case here, where the do_adjtimex() was the deadlock trigger point. Unfortuantely, in update_wall_time() we still hold the jiffies lock, which would deadlock with the ipi triggered by clock_was_set(), preventing us from calling it even after we drop the timekeeping lock. So instead call clock_was_set_delayed() at that point. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Fix missing timekeeping_update in suspend pathJohn Stultz2014-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 330a1617b0a6268d427aa5922c94d082b1d3e96d upstream. Since 48cdc135d4840 (Implement a shadow timekeeper), we have to call timekeeping_update() after any adjustment to the timekeeping structure in order to make sure that any adjustments to the structure persist. In the timekeeping suspend path, we udpate the timekeeper structure, so we should be sure to update the shadow-timekeeper before releasing the timekeeping locks. Currently this isn't done. In most cases, the next time related code to run would be timekeeping_resume, which does update the shadow-timekeeper, but in an abundence of caution, this patch adds the call to timekeeping_update() in the suspend path. Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Fix CLOCK_TAI timer/nanosleep delaysJohn Stultz2014-02-13
| | | | | | | | | | | | | | | | | | | | commit 04005f6011e3b504cd4d791d9769f7cb9a3b2eae upstream. A think-o in the calculation of the monotonic -> tai time offset results in CLOCK_TAI timers and nanosleeps to expire late (the latency is ~2x the tai offset). Fix this by adding the tai offset from the realtime offset instead of subtracting. Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Fix lost updates to tai adjustmentJohn Stultz2014-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | commit f55c07607a38f84b5c7e6066ee1cfe433fa5643c upstream. Since 48cdc135d4840 (Implement a shadow timekeeper), we have to call timekeeping_update() after any adjustment to the timekeeping structure in order to make sure that any adjustments to the structure persist. Unfortunately, the updates to the tai offset via adjtimex do not trigger this update, causing adjustments to the tai offset to be made and then over-written by the previous value at the next update_wall_time() call. This patch resovles the issue by calling timekeeping_update() right after setting the tai offset. Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Have function graph only trace based on global_ops filtersSteven Rostedt2014-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 23a8e8441a0a74dd612edf81dc89d1600bc0a3d1 upstream. Doing some different tests, I discovered that function graph tracing, when filtered via the set_ftrace_filter and set_ftrace_notrace files, does not always keep with them if another function ftrace_ops is registered to trace functions. The reason is that function graph just happens to trace all functions that the function tracer enables. When there was only one user of function tracing, the function graph tracer did not need to worry about being called by functions that it did not want to trace. But now that there are other users, this becomes a problem. For example, one just needs to do the following: # cd /sys/kernel/debug/tracing # echo schedule > set_ftrace_filter # echo function_graph > current_tracer # cat trace [..] 0) | schedule() { ------------------------------------------ 0) <idle>-0 => rcu_pre-7 ------------------------------------------ 0) ! 2980.314 us | } 0) | schedule() { ------------------------------------------ 0) rcu_pre-7 => <idle>-0 ------------------------------------------ 0) + 20.701 us | } # echo 1 > /proc/sys/kernel/stack_tracer_enabled # cat trace [..] 1) + 20.825 us | } 1) + 21.651 us | } 1) + 30.924 us | } /* SyS_ioctl */ 1) | do_page_fault() { 1) | __do_page_fault() { 1) 0.274 us | down_read_trylock(); 1) 0.098 us | find_vma(); 1) | handle_mm_fault() { 1) | _raw_spin_lock() { 1) 0.102 us | preempt_count_add(); 1) 0.097 us | do_raw_spin_lock(); 1) 2.173 us | } 1) | do_wp_page() { 1) 0.079 us | vm_normal_page(); 1) 0.086 us | reuse_swap_page(); 1) 0.076 us | page_move_anon_rmap(); 1) | unlock_page() { 1) 0.082 us | page_waitqueue(); 1) 0.086 us | __wake_up_bit(); 1) 1.801 us | } 1) 0.075 us | ptep_set_access_flags(); 1) | _raw_spin_unlock() { 1) 0.098 us | do_raw_spin_unlock(); 1) 0.105 us | preempt_count_sub(); 1) 1.884 us | } 1) 9.149 us | } 1) + 13.083 us | } 1) 0.146 us | up_read(); When the stack tracer was enabled, it enabled all functions to be traced, which now the function graph tracer also traces. This is a side effect that should not occur. To fix this a test is added when the function tracing is changed, as well as when the graph tracer is enabled, to see if anything other than the ftrace global_ops function tracer is enabled. If so, then the graph tracer calls a test trampoline that will look at the function that is being traced and compare it with the filters defined by the global_ops. As an optimization, if there's no other function tracers registered, or if the only registered function tracers also use the global ops, the function graph infrastructure will call the registered function graph callback directly and not go through the test trampoline. Fixes: d2d45c7a03a2 "tracing: Have stack_tracer use a separate list of functions" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Fix synchronization location disabling and freeing ftrace_opsSteven Rostedt2014-02-13
| | | | | | | | | | | | | | | | | | | commit a4c35ed241129dd142be4cadb1e5a474a56d5464 upstream. The synchronization needed after ftrace_ops are unregistered must happen after the callback is disabled from becing called by functions. The current location happens after the function is being removed from the internal lists, but not after the function callbacks were disabled, leaving the functions susceptible of being called after their callbacks are freed. This affects perf and any externel users of function tracing (LTTng and SystemTap). Fixes: cdbe61bfe704 "ftrace: Allow dynamically allocated function tracers" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Synchronize setting function_trace_op with ftrace_trace_functionSteven Rostedt2014-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 405e1d834807e51b2ebd3dea81cb51e53fb61504 upstream. ftrace_trace_function is a variable that holds what function will be called directly by the assembly code (mcount). If just a single function is registered and it handles recursion itself, then the assembly will call that function directly without any helper function. It also passes in the ftrace_op that was registered with the callback. The ftrace_op to send is stored in the function_trace_op variable. The ftrace_trace_function and function_trace_op needs to be coordinated such that the called callback wont be called with the wrong ftrace_op, otherwise bad things can happen if it expected a different op. Luckily, there's no callback that doesn't use the helper functions that requires this. But there soon will be and this needs to be fixed. Use a set_function_trace_op to store the ftrace_op to set the function_trace_op to when it is safe to do so (during the update function within the breakpoint or stop machine calls). Or if dynamic ftrace is not being used (static tracing) then we have to do a bit more synchronization when the ftrace_trace_function is set as that takes affect immediately (as oppose to dynamic ftrace doing it with the modification of the trampoline). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* audit: reset audit backlog wait time after error recoveryRichard Guy Briggs2014-02-13
| | | | | | | | | | | | | | | | | | | commit e789e561a50de0aaa8c695662d97aaa5eac9d55f upstream. When the audit queue overflows and times out (audit_backlog_wait_time), the audit queue overflow timeout is set to zero. Once the audit queue overflow timeout condition recovers, the timeout should be reset to the original value. See also: https://lkml.org/lkml/2013/9/2/473 Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Check if tracing is enabled in trace_puts()Steven Rostedt (Red Hat)2014-02-13
| | | | | | | | | | | | | | commit 3132e107d608f8753240d82d61303c500fd515b4 upstream. If trace_puts() is used very early in boot up, it can crash the machine if it is called before the ring buffer is allocated. If a trace_printk() is used with no arguments, then it will be converted into a trace_puts() and suffer the same fate. Fixes: 09ae72348ecc "tracing: Add trace_puts() for even faster trace_printk() tracing" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Have trace buffer point back to trace_arraySteven Rostedt (Red Hat)2014-02-13
| | | | | | | | | | | | | | | | | commit dced341b2d4f06668efaab33f88de5d287c0f45b upstream. The trace buffer has a descriptor pointer that goes back to the trace array. But it was never assigned. Luckily, nothing uses it (yet), but it will in the future. Although nothing currently uses this, if any of the new features get backported to older kernels, and because this is such a simple change, I'm marking it for stable too. Fixes: 12883efb670c "tracing: Consolidate max_tr into main trace_array structure" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched: Guarantee new group-entities always have weightPaul Turner2014-01-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0ac9b1c21874d2490331233b3242085f8151e166 upstream. Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>