aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched
Commit message (Collapse)AuthorAge
* Core: call IPI notifier from generic scheduler IPI handlerBjoern Brandenburg2017-06-15
| | | | | Instead of replicating this code across all supported architectures, hook into the IPI code only once.
* LITMUS^RT integration in core scheduler: include debug_trace.hBjoern Brandenburg2017-06-09
|
* Protect LITMUS^RT tasks from re-nicingBjoern Brandenburg2017-05-26
| | | | | Assigning a nice value to LITMUS^RT tasks is meaningless. Bail out early.
* Don't call set_tsk_need_resched() on remote LITMUS^RT taskBjoern Brandenburg2017-05-26
| | | | This patch fixes a BUG_ON() in litmus/preempt.c.
* Hook into SCHED_DEADLINE to protect LITMUS^RT tasksBjoern Brandenburg2017-05-26
| | | | | SCHED_DEADLINE should not preempt LITMUS^RT tasks, as the LITMUS^RT scheduling class is positioned above the SCHED_DEADLINE policy.
* Hook into rt scheduling class to protect LITMUS^RT tasksBjoern Brandenburg2017-05-26
| | | | | | The rt scheduling class thinks it's the highest-priority scheduling class around, next to SCHED_DEADLINE. It is not in LITMUS^RT. Don't go preempting remote cores that run SCHED_LITMUS tasks.
* Don't trigger load balancer in scheduler tick for LITMUS^RTBjoern Brandenburg2017-05-26
|
* Hook into finish_switch()Bjoern Brandenburg2017-05-26
| | | | To keep track of stack usage and to notify plugin, if necessary.
* Reset SCHED_LITMUS scheduling class on forkBjoern Brandenburg2017-05-26
|
* Block sched_setaffinity() for SCHED_LITMUS tasksBjoern Brandenburg2017-05-26
|
* Disable cut-to-CFS optimization in Linux schedulerBjoern Brandenburg2017-05-26
| | | | | Global plugins require that the plugin be called even if there currently is no real-time task executing on the local core.
* Hook into fork(), exec(), and exit()Bjoern Brandenburg2017-05-26
| | | | | Allow LITMUS^RT to do some work when a process is created or terminated.
* Integrate LITMUS^RT scheduling class with sched_setschedulerBjoern Brandenburg2017-05-26
|
* Integrate LITMUS^RT with try_to_wake_up() pathBjoern Brandenburg2017-05-26
|
* Make LITMUS^RT scheduling class the highest-priority scheduling classBjoern Brandenburg2017-05-26
| | | | | | Needs to be above stop_machine_class for legacy reasons; the main plugins were developed before stop_machine_class was introduced and assume that they are the highest-priority scheduling class.
* Add LITMUS^RT scheduling class in kernel/sched/MakefileBjoern Brandenburg2017-05-26
|
* Introduce LITMUS^RT runqueue dummy into struct rqBjoern Brandenburg2017-05-26
|
* Hookup sched_trace_XXX() tracing in Linux schedulerBjoern Brandenburg2017-05-26
| | | | This patch adds context switch tracing to the main Linux scheduler.
* Hook into __schedule() to set litmus_preemption_in_progressBjoern Brandenburg2017-05-26
|
* Call sched_state_task_picked() from pick_next_task_stop()Bjoern Brandenburg2017-05-26
| | | | | Otherwise, the scheduler state machine becomes confused (and goes into a rescheduling loop) when stop-machine is triggered.
* Integrate preemption state machine with Linux schedulerBjoern Brandenburg2017-05-26
| | | | Track when a processor is going to schedule "soon".
* Add LITMUS^RT core implementationBjoern Brandenburg2017-05-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the core of LITMUS^RT: - library functionality (heaps, rt_domain, prioritization, etc.) - budget enforcement logic - job management - system call backends - virtual devices (control page, etc.) - scheduler plugin API (and dummy plugin) This code compiles, but is not yet integrated with the rest of Linux. Squashed changes: LITMUS^RT Core: add get_current_budget() system call Allow userspace to figure out the used-up and remaining budget of a task. Adds deadline field to control page and updates it when setting up jobs for release. Adds control page deadline offset ftdev: respect O_NONBLOCK flag in ftdev_read() Don't block if userspace wants to go on doing something else. Export job release time and job sequence number in ctrl page Add alternate complete_job() default implementation Let jobs sleep like regular Linux tasks by suspending and waking them with a one-shot timer. Plugins can opt into using this implementation instead of the classic complete_job() implementation (or custom implementations). Fix RCU locking in sys_get_rt_task_param() sys_get_rt_task_param() is rarely used and apparently attracted some bitrot. Free before setting NULL to prevent memory leak Add hrtimer_start_on() support This patch replaces the previous implementation of hrtimer_start_on() by now using smp_call_function_single_async() to arm hrtimers on remote CPUs. Expose LITMUS^RT system calls via control page ioctl() Rationale: make LITMUS^RT ops available in a way that does not create merge conflicts each time we rebase LITMUS^RT on top of a new kernel version. This also helps with portability to different architectures, as we no longer need to patch each architecture's syscall table. Pick non-zero syscall ID start range To avoid interfering with Linux's magic reserved IOCTL numbers Don't preempt before time check in sleep_until_next_release() Avoid preempting jobs that are about to go to sleep soon anyway. LITMUS^RT proc: fix wrong memset() TRACE(): add TRACE_WARN_ON() helper Useful to replace BUG_ON() and WARN_ON() with a non-fatal TRACE()-based equivalent. Add void* plugin_state pointer to task_struct LITMUS^RT: split task admission into two functions Plugin interface: add fork_task() callback LITMUS^RT: Enable plugins to permit RT tasks to fork one-shot complete_job(): set completed flag This could race with a SIGSTOP or some other forced suspension, but we'll let plugins handle this, should they actually care. FP: add list-based ready queue LITMUS^RT core: add should_wait_for_stack() callback Allow plugins to give up when waiting for a stack to become available. LITMUS^RT core: add next_became_invalid() callback LITMUS^RT core: add post-migration validation callback LITMUS^RT core: be more careful when pull-migrating tasks Close more race windows and give plugins a chance to validate tasks after they have been migrated. Add KConfig options for timer latency warnings Add reservation creation API to plugin interface & syscalls LITMUS^RT syscall: expose sys_reservation_create() via ioctl() Add reservation configuration types to rt_param.h Add basic generic reservation-based scheduling infrastructure Switch to aligned quanta by default. For first-time users, aligned quanta is likely what's expected. LITMUS^RT core: keep track of time of last suspension This information is needed to insert ST_COMPLETION records for sporadic tasks. add fields for clock_nanosleep() support Need to communicate the intended wake-up time to the plugin wake-up handler. LITMUS^RT core: add generic handler for sporadic job arrivals In particular, check if a job arrival is triggered from a clock_nanosleep() call. add litmus->task_change_params() callback to plugin interface Will be used by adaptive C-EDF. Call litmus->task_change_params() from sys_set_rt_task_param() Move trace point definition to litmus/litmus.c If !CONFIG_SCHED_TASK_TRACE, but CONFIG_SCHED_LITMUS_TRACEPOINT, then we still need to define the tracepoint structures. This patch should be integrated with the earlier sched_task_trace.c patches during one of the next major rebasing efforts. LITMUS^RT scheduling class: mark enqueued task as present Remove unistd_*.h rebase fix: update to new hrtimer API The new API is actually nicer and cleaner. rebase fix: call lockdep_unpin_lock(&rq->lock, cookie) The LITMUS^RT scheduling class should also do the LOCKDEP dance. LITMUS^RT core: break out non-preemptive flag defs Not every file including litmus.h needs to know this. LITMUS^RT core: don't include debug_trace.h in litmus.h Including debug_trace.h introduces the TRACE() macro, which causes symbol clashes in some (rather obscure) drivers. LITMUS^RT core: add litmus_preemption_in_progress flags Used to communicate that a preemption is in progress. Set by the scheduler; read by the plugins. LITMUS^RT core: revise is_current_running() macro
* Add SCHED, SCHED2, and CXS overhead tracepointsBjoern Brandenburg2017-05-26
| | | | | | | This patch integrates the overhead tracepoints into the Linux scheduler that are compatible with plain vanilla Linux (i.e., not specific to LITMUS^RT plugins). This can be used to measure the overheads of an otherwise unmodified kernel.
* Integrate ft_irq_fired() with LinuxBjoern Brandenburg2017-05-26
| | | | | | | | | | | | | | | | This patch hooks up Feather-Trace's ft_irq_fired() handler with Linux's interrupt handling infrastructure. IRQ tracing: ft_irq_fired() only if irq_enter() was skipped On x86, the rescheduling IPI path already calls irq_enter(), which calls ft_irq_fired(), so we don't have to do it again. IRQ tracing: don't count softirqs Only triggered by irq_exit(), which implies that we called irq_enter(), which means that we already traced the current hard IRQ.
* sched/rt: Add a missing rescheduling pointSebastian Andrzej Siewior2017-03-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 619bd4a71874a8fd78eb6ccf9f272c5e98bcc7b7 upstream. Since the change in commit: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks") ... we don't reschedule a task under certain circumstances: Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on CPU0) and holds a PI lock. This task is removed from the CPU because it used up its time slice and another SCHED_OTHER task is running. Task-B on CPU1 runs at RT priority and asks for the lock owned by task-A. This results in a priority boost for task-A. Task-B goes to sleep until the lock has been made available. Task-A is already runnable (but not active), so it receives no wake up. The reality now is that task-A gets on the CPU once the scheduler decides to remove the current task despite the fact that a high priority task is enqueued and waiting. This may take a long time. The desired behaviour is that CPU0 immediately reschedules after the priority boost which made task-A the task with the lowest priority. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks") Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/autogroup: Fix 64-bit kernel nice level adjustmentMike Galbraith2016-11-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Michael Kerrisk reported: > Regarding the previous paragraph... My tests indicate > that writing *any* value to the autogroup [nice priority level] > file causes the task group to get a lower priority. Because autogroup didn't call the then meaningless scale_load()... Autogroup nice level adjustment has been broken ever since load resolution was increased for 64-bit kernels. Use scale_load() to scale group weight. Michael Kerrisk tested this patch to fix the problem: > Applied and tested against 4.9-rc6 on an Intel u7 (4 cores). > Test setup: > > Terminal window 1: running 40 CPU burner jobs > Terminal window 2: running 40 CPU burner jobs > Terminal window 1: running 1 CPU burner job > > Demonstrated that: > * Writing "0" to the autogroup file for TW1 now causes no change > to the rate at which the process on the terminal consume CPU. > * Writing -20 to the autogroup file for TW1 caused those processes > to get the lion's share of CPU while TW2 TW3 get a tiny amount. > * Writing -20 to the autogroup files for TW1 and TW3 allowed the > process on TW3 to get as much CPU as it was getting as when > the autogroup nice values for both terminals were 0. Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-man <linux-man@vger.kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1479897217.4306.6.camel@gmx.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/autogroup: Do not use autogroup->tg in zombie threadsOleg Nesterov2016-11-22
| | | | | | | | | | | | | | | | | | | | | Exactly because for_each_thread() in autogroup_move_group() can't see it and update its ->sched_task_group before _put() and possibly free(). So the exiting task needs another sched_move_task() before exit_notify() and we need to re-introduce the PF_EXITING (or similar) check removed by the previous change for another reason. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Cc: vlovejoy@redhat.com Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()Oleg Nesterov2016-11-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove it, but see the next patch. However the comment is correct in that autogroup_move_group() must always change task_group() for every thread so the sysctl_ check is very wrong; we can race with cgroups and even sys_setsid() is not safe because a task running with task_group() == ag->tg must participate in refcounting: int main(void) { int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY); assert(sctl > 0); if (fork()) { wait(NULL); // destroy the child's ag/tg pause(); } assert(pwrite(sctl, "1\n", 2, 0) == 2); assert(setsid() > 0); if (fork()) pause(); kill(getppid(), SIGKILL); sleep(1); // The child has gone, the grandchild runs with kref == 1 assert(pwrite(sctl, "0\n", 2, 0) == 2); assert(setsid() > 0); // runs with the freed ag/tg for (;;) sleep(1); return 0; } crashes the kernel. It doesn't really need sleep(1), it doesn't matter if autogroup_move_group() actually frees the task_group or this happens later. Reported-by: Vern Lovejoy <vlovejoy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/core: Remove pointless printout in sched_show_task()Linus Torvalds2016-11-03
| | | | | | | | | | | | | | | | | | | | In sched_show_task() we print out a useless hex number, not even a symbol, and there's a big question mark whether this even makes sense anyway, I suspect we should just remove it all. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: jann@thejh.net Cc: keescook@chromium.org Cc: linux-api@vger.kernel.org Cc: tycho.andersen@canonical.com Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/core: Fix oops in sched_show_task()Tetsuo Handa2016-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread remains in the task list after its stack pointer was already set to NULL. Therefore, thread_saved_pc() and stack_not_used() in sched_show_task() will trigger NULL pointer dereference if an attempt to dump such thread's traces (e.g. SysRq-t, khungtaskd) is made. Since show_stack() in sched_show_task() calls try_get_task_stack() and sched_show_task() is called from interrupt context, calling try_get_task_stack() from sched_show_task() will be safe as well. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: jann@thejh.net Cc: keescook@chromium.org Cc: linux-api@vger.kernel.org Cc: tycho.andersen@canonical.com Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
*-. Merge branches 'core-urgent-for-linus', 'irq-urgent-for-linus' and ↵Linus Torvalds2016-10-28
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool, irq and scheduler fixes from Ingo Molnar: "One more objtool fixlet for GCC6 code generation patterns, an irq DocBook fix and an unused variable warning fix in the scheduler" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix rare switch jump table pattern detection * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: doc: Add missing parameter for msi_setup * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Remove unused but set variable 'rq'
| | * sched/fair: Remove unused but set variable 'rq'Tobias Klauser2016-10-27
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit: 8663e24d56dc ("sched/fair: Reorder cgroup creation code") ... the variable 'rq' in alloc_fair_sched_group() is set but no longer used. Remove it to fix the following GCC warning when building with 'W=1': kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable] Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch Signed-off-by: Ingo Molnar <mingo@kernel.org>
* / mm: remove per-zone hashtable of bitlock waitqueuesLinus Torvalds2016-10-27
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The per-zone waitqueues exist because of a scalability issue with the page waitqueues on some NUMA machines, but it turns out that they hurt normal loads, and now with the vmalloced stacks they also end up breaking gfs2 that uses a bit_wait on a stack object: wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE) where 'gh' can be a reference to the local variable 'mount_gh' on the stack of fill_super(). The reason the per-zone hash table breaks for this case is that there is no "zone" for virtual allocations, and trying to look up the physical page to get at it will fail (with a BUG_ON()). It turns out that I actually complained to the mm people about the per-zone hash table for another reason just a month ago: the zone lookup also hurts the regular use of "unlock_page()" a lot, because the zone lookup ends up forcing several unnecessary cache misses and generates horrible code. As part of that earlier discussion, we had a much better solution for the NUMA scalability issue - by just making the page lock have a separate contention bit, the waitqueue doesn't even have to be looked at for the normal case. Peter Zijlstra already has a patch for that, but let's see if anybody even notices. In the meantime, let's fix the actual gfs2 breakage by simplifying the bitlock waitqueues and removing the per-zone issue. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Tested-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched/fair: Fix incorrect task group ->load_avgVincent Guittot2016-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A scheduler performance regression has been reported by Joseph Salisbury, which he bisected back to: 3d30544f0212 ("sched/fair: Apply more PELT fixes) The regression triggers when several levels of task groups are involved (read: SystemD) and cpu_possible_mask != cpu_present_mask. The root cause is that group entity's load (tg_child->se[i]->avg.load_avg) is initialized to scale_load_down(se->load.weight). During the creation of a child task group, its group entities on possible CPUs are attached to parent's cfs_rq (tg_parent) and their loads are added to the parent's load (tg_parent->load_avg) with update_tg_load_avg(). But only the load on online CPUs will then be updated to reflect real load, whereas load on other CPUs will stay at the initial value. The result is a tg_parent->load_avg that is higher than the real load, the weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is smaller than it should be, and the task group gets a less running time than what it could expect. ( This situation can be detected with /proc/sched_debug. The ".tg_load_avg" of the task group will be much higher than sum of ".tg_load_avg_contrib" of online cfs_rqs of the task group. ) The load of group entities don't have to be intialized to something else than 0 because their load will increase when an entity is attached. Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@vger.kernel.org> # 4.8.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: joonwoop@codeaurora.org Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes) Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2016-10-18
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a crash that can trigger when racing with CPU hotplug: we didn't use sched-domains data structures carefully enough in select_idle_cpu()" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
| * sched/fair: Fix sched domains NULL dereference in select_idle_sibling()Wanpeng Li2016-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit: 10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()") ... improved select_idle_sibling(), but also triggered a regression (crash) during CPU-hotplug: BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0 Call Trace: <IRQ> select_task_rq_fair+0x749/0x930 ? select_task_rq_fair+0xb4/0x930 ? __lock_is_held+0x54/0x70 try_to_wake_up+0x19a/0x5b0 default_wake_function+0x12/0x20 autoremove_wake_function+0x12/0x40 __wake_up_common+0x55/0x90 __wake_up+0x39/0x50 wake_up_klogd_work_func+0x40/0x60 irq_work_run_list+0x57/0x80 irq_work_run+0x2c/0x30 smp_irq_work_interrupt+0x2e/0x40 irq_work_interrupt+0x96/0xa0 <EOI> ? _raw_spin_unlock_irqrestore+0x45/0x80 try_to_wake_up+0x4a/0x5b0 wake_up_state+0x10/0x20 __kthread_unpark+0x67/0x70 kthread_unpark+0x22/0x30 cpuhp_online_idle+0x3e/0x70 cpu_startup_entry+0x6a/0x450 start_secondary+0x154/0x180 This can be reproduced by running the ftrace test case of kselftest, the test case will hot-unplug the CPU and the CPU will attach to the NULL sched-domain during scheduler teardown. The step 2 for the rewrite select_idle_siblings(): | Step 2) tracks the average cost of the scan and compares this to the | average idle time guestimate for the CPU doing the wakeup. If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL sched domain will be dereferenced to acquire the average cost of the scan. This patch fix it by failing the search of an idle CPU in the LLC process if this sched domain is NULL. Tested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge tag 'gcc-plugins-v4.9-rc1' of ↵Linus Torvalds2016-10-15
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull gcc plugins update from Kees Cook: "This adds a new gcc plugin named "latent_entropy". It is designed to extract as much possible uncertainty from a running system at boot time as possible, hoping to capitalize on any possible variation in CPU operation (due to runtime data differences, hardware differences, SMP ordering, thermal timing variation, cache behavior, etc). At the very least, this plugin is a much more comprehensive example for how to manipulate kernel code using the gcc plugin internals" * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: latent_entropy: Mark functions with __latent_entropy gcc-plugins: Add latent_entropy plugin
| * | latent_entropy: Mark functions with __latent_entropyEmese Revfy2016-10-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The __latent_entropy gcc attribute can be used only on functions and variables. If it is on a function then the plugin will instrument it for gathering control-flow entropy. If the attribute is on a variable then the plugin will initialize it with random contents. The variable must be an integer, an integer array type or a structure with integer fields. These specific functions have been selected because they are init functions (to help gather boot-time entropy), are called at unpredictable times, or they have variable loops, each of which provide some level of latent entropy. Signed-off-by: Emese Revfy <re.emese@gmail.com> [kees: expanded commit message] Signed-off-by: Kees Cook <keescook@chromium.org>
* | | Merge branch 'for-4.9' of ↵Linus Torvalds2016-10-14
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - tracepoints for basic cgroup management operations added - kernfs and cgroup path formatting functions updated to behave in the style of strlcpy() - non-critical bug fixes * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() cpuset: fix error handling regression in proc_cpuset_show() cgroup: add tracepoints for basic operations cgroup: make cgroup_path() and friends behave in the style of strlcpy() kernfs: remove kernfs_path_len() kernfs: make kernfs_path*() behave in the style of strlcpy() kernfs: add dummy implementation of kernfs_path_from_node()
| * | | cgroup: make cgroup_path() and friends behave in the style of strlcpy()Tejun Heo2016-08-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_path() and friends used to format the path from the end and thus the resulting path usually didn't start at the start of the passed in buffer. Also, when the buffer was too small, the partial result was truncated from the head rather than tail and there was no way to tell how long the full path would be. These make the functions less robust and more awkward to use. With recent updates to kernfs_path(), cgroup_path() and friends can be made to behave in strlcpy() style. * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now always return the length of the full path. If buffer is too small, it contains nul terminated truncated output. * All users updated accordingly. v2: cgroup_path() usage in kernel/sched/debug.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Peter Zijlstra <peterz@infradead.org>
* | | | nmi_backtrace: generate one-line reports for idle cpusChris Metcalf2016-10-07
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing an nmi backtrace of many cores, most of which are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN". We do this by grouping all the cpuidle code together into a new .cpuidle.text section, and then checking the address of the interrupted PC to see if it lies within that section. This commit suitably tags x86 and tile idle routines, and only adds in the minimal framework for other architectures. Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm] Tested-by: Petr Mladek <pmladek@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds2016-10-03
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull low-level x86 updates from Ingo Molnar: "In this cycle this topic tree has become one of those 'super topics' that accumulated a lot of changes: - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on x86 - preceded by an array of changes. v4.8 saw preparatory changes in this area already - this is the rest of the work. Includes the thread stack caching performance optimization. (Andy Lutomirski) - switch_to() cleanups and all around enhancements. (Brian Gerst) - A large number of dumpstack infrastructure enhancements and an unwinder abstraction. The secret long term plan is safe(r) live patching plus maybe another attempt at debuginfo based unwinding - but all these current bits are standalone enhancements in a frame pointer based debug environment as well. (Josh Poimboeuf) - More __ro_after_init and const annotations. (Kees Cook) - Enable KASLR for the vmemmap memory region. (Thomas Garnier)" [ The virtually mapped stack changes are pretty fundamental, and not x86-specific per se, even if they are only used on x86 right now. ] * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits) x86/asm: Get rid of __read_cr4_safe() thread_info: Use unsigned long for flags x86/alternatives: Add stack frame dependency to alternative_call_2() x86/dumpstack: Fix show_stack() task pointer regression x86/dumpstack: Remove dump_trace() and related callbacks x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder oprofile/x86: Convert x86_backtrace() to use the new unwinder x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder perf/x86: Convert perf_callchain_kernel() to use the new unwinder x86/unwind: Add new unwind interface and implementations x86/dumpstack: Remove NULL task pointer convention fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK lib/syscall: Pin the task stack in collect_syscall() x86/process: Pin the target stack in get_wchan() x86/dumpstack: Pin the target stack when dumping it kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function sched/core: Add try_get_task_stack() and put_task_stack() x86/entry/64: Fix a minor comment rebase error iommu/amd: Don't put completion-wait semaphore on stack ...
| * | | sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASKAndy Lutomirski2016-09-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently keep every task's stack around until the task_struct itself is freed. This means that we keep the stack allocation alive for longer than necessary and that, under load, we free stacks in big batches whenever RCU drops the last task reference. Neither of these is good for reuse of cache-hot memory, and freeing in batches prevents us from usefully caching small numbers of vmalloced stacks. On architectures that have thread_info on the stack, we can't easily change this, but on architectures that set THREAD_INFO_IN_TASK, we can free it as soon as the task is dead. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jann Horn <jann@thejh.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/core: Allow putting thread_info into task_structAndy Lutomirski2016-09-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT, then thread_info is defined as a single 'u32 flags' and is the first entry of task_struct. thread_info::task is removed (it serves no purpose if thread_info is embedded in task_struct), and thread_info::cpu gets its own slot in task_struct. This is heavily based on a patch written by Linus. Originally-from: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jann Horn <jann@thejh.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | Merge branch 'linus' into x86/asm, to pick up recent fixesIngo Molnar2016-09-15
| |\ \ \ | | | |/ | | |/| | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched: Remove __schedule() non-standard frame annotationBrian Gerst2016-08-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the x86 switch_to() uses the standard C calling convention, the STACK_FRAME_NON_STANDARD() annotation is no longer needed. Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Brian Gerst <brgerst@gmail.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1471106302-10159-8-git-send-email-brgerst@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2016-10-03
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes are: - irqtime accounting cleanups and enhancements. (Frederic Weisbecker) - schedstat debugging enhancements, make it more broadly runtime available. (Josh Poimboeuf) - More work on asymmetric topology/capacity scheduling. (Morten Rasmussen) - sched/wait fixes and cleanups. (Oleg Nesterov) - PELT (per entity load tracking) improvements. (Peter Zijlstra) - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra) - sched/numa enhancements/fixes (Rik van Riel) - sched/cputime scalability improvements (Stanislaw Gruszka) - Load calculation arithmetics fixes. (Dietmar Eggemann) - sched/deadline enhancements (Tommaso Cucinotta) - Fix utilization accounting when switching to the SCHED_NORMAL policy. (Vincent Guittot) - ... plus misc cleanups and enhancements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/irqtime: Consolidate irqtime flushing code sched/irqtime: Consolidate accounting synchronization with u64_stats API u64_stats: Introduce IRQs disabled helpers sched/irqtime: Remove needless IRQs disablement on kcpustat update sched/irqtime: No need for preempt-safe accessors sched/fair: Fix min_vruntime tracking sched/debug: Add SCHED_WARN_ON() sched/core: Fix set_user_nice() sched/fair: Introduce set_curr_task() helper sched/core, ia64: Rename set_curr_task() sched/core: Fix incorrect utilization accounting when switching to fair class sched/core: Optimize SCHED_SMT sched/core: Rewrite and improve select_idle_siblings() sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared sched/core: Introduce 'struct sched_domain_shared' sched/core: Restructure destroy_sched_domain() sched/core: Remove unused @cpu argument from destroy_sched_domain*() sched/wait: Introduce init_wait_entry() sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock() sched/wait: Avoid abort_exclusive_wait() in ___wait_event() ...
| * | | | sched/irqtime: Consolidate irqtime flushing codeFrederic Weisbecker2016-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code performing irqtime nsecs stats flushing to kcpustat is roughly the same for hardirq and softirq. So lets consolidate that common code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1474849761-12678-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/irqtime: Consolidate accounting synchronization with u64_stats APIFrederic Weisbecker2016-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The irqtime accounting currently implement its own ad hoc implementation of u64_stats API. Lets rather consolidate it with the appropriate library. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1474849761-12678-5-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/irqtime: Remove needless IRQs disablement on kcpustat updateFrederic Weisbecker2016-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The callers of the functions performing irqtime kcpustat updates have IRQS disabled, no need to disable them again. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1474849761-12678-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>