aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2013-05-05 16:23:27 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2013-05-05 16:23:27 -0400
commit534c97b0950b1967bca1c753aeaed32f5db40264 (patch)
tree9421d26e4f6d479d1bc32b036a731b065daab0fa /kernel
parent64049d1973c1735f543eb7a55653e291e108b0cb (diff)
parent265f22a975c1e4cc3a4d1f94a3ec53ffbb6f5b9f (diff)
Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull 'full dynticks' support from Ingo Molnar: "This tree from Frederic Weisbecker adds a new, (exciting! :-) core kernel feature to the timer and scheduler subsystems: 'full dynticks', or CONFIG_NO_HZ_FULL=y. This feature extends the nohz variable-size timer tick feature from idle to busy CPUs (running at most one task) as well, potentially reducing the number of timer interrupts significantly. This feature got motivated by real-time folks and the -rt tree, but the general utility and motivation of full-dynticks runs wider than that: - HPC workloads get faster: CPUs running a single task should be able to utilize a maximum amount of CPU power. A periodic timer tick at HZ=1000 can cause a constant overhead of up to 1.0%. This feature removes that overhead - and speeds up the system by 0.5%-1.0% on typical distro configs even on modern systems. - Real-time workload latency reduction: CPUs running critical tasks should experience as little jitter as possible. The last remaining source of kernel-related jitter was the periodic timer tick. - A single task executing on a CPU is a pretty common situation, especially with an increasing number of cores/CPUs, so this feature helps desktop and mobile workloads as well. The cost of the feature is mainly related to increased timer reprogramming overhead when a CPU switches its tick period, and thus slightly longer to-idle and from-idle latency. Configuration-wise a third mode of operation is added to the existing two NOHZ kconfig modes: - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named as a config option. This is the traditional Linux periodic tick design: there's a HZ tick going on all the time, regardless of whether a CPU is idle or not. - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the periodic tick when a CPU enters idle mode. - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the tick when a CPU is idle, also slows the tick down to 1 Hz (one timer interrupt per second) when only a single task is running on a CPU. The .config behavior is compatible: existing !CONFIG_NO_HZ and CONFIG_NO_HZ=y settings get translated to the new values, without the user having to configure anything. CONFIG_NO_HZ_FULL is turned off by default. This feature is based on a lot of infrastructure work that has been steadily going upstream in the last 2-3 cycles: related RCU support and non-periodic cputime support in particular is upstream already. This tree adds the final pieces and activates the feature. The pull request is marked RFC because: - it's marked 64-bit only at the moment - the 32-bit support patch is small but did not get ready in time. - it has a number of fresh commits that came in after the merge window. The overwhelming majority of commits are from before the merge window, but still some aspects of the tree are fresh and so I marked it RFC. - it's a pretty wide-reaching feature with lots of effects - and while the components have been in testing for some time, the full combination is still not very widely used. That it's default-off should reduce its regression abilities and obviously there are no known regressions with CONFIG_NO_HZ_FULL=y enabled either. - the feature is not completely idempotent: there is no 100% equivalent replacement for a periodic scheduler/timer tick. In particular there's ongoing work to map out and reduce its effects on scheduler load-balancing and statistics. This should not impact correctness though, there are no known regressions related to this feature at this point. - it's a pretty ambitious feature that with time will likely be enabled by most Linux distros, and we'd like you to make input on its design/implementation, if you dislike some aspect we missed. Without flaming us to crisp! :-) Future plans: - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off the periodic tick altogether when there's a single busy task on a CPU. We'd first like 1 Hz to be exposed more widely before we go for the 0 Hz target though. - once we reach 0 Hz we can remove the periodic tick assumption from nr_running>=2 as well, by essentially interrupting busy tasks only as frequently as the sched_latency constraints require us to do - once every 4-40 msecs, depending on nr_running. I am personally leaning towards biting the bullet and doing this in v3.10, like the -rt tree this effort has been going on for too long - but the final word is up to you as usual. More technical details can be found in Documentation/timers/NO_HZ.txt" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) sched: Keep at least 1 tick per second for active dynticks tasks rcu: Fix full dynticks' dependency on wide RCU nocb mode nohz: Protect smp_processor_id() in tick_nohz_task_switch() nohz_full: Add documentation. cputime_nsecs: use math64.h for nsec resolution conversion helpers nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config nohz: Reduce overhead under high-freq idling patterns nohz: Remove full dynticks' superfluous dependency on RCU tree nohz: Fix unavailable tick_stop tracepoint in dynticks idle nohz: Add basic tracing nohz: Select wide RCU nocb for full dynticks nohz: Disable the tick when irq resume in full dynticks CPU nohz: Re-evaluate the tick for the new task after a context switch nohz: Prepare to stop the tick on irq exit nohz: Implement full dynticks kick nohz: Re-evaluate the tick from the scheduler IPI sched: New helper to prevent from stopping the tick in full dynticks sched: Kick full dynticks CPU that have more than one task enqueued. perf: New helper to prevent full dynticks CPUs from stopping tick perf: Kick full dynticks CPU if events rotation is needed ...
Diffstat (limited to 'kernel')
-rw-r--r--kernel/events/core.c17
-rw-r--r--kernel/hrtimer.c4
-rw-r--r--kernel/posix-cpu-timers.c76
-rw-r--r--kernel/rcutree.c16
-rw-r--r--kernel/rcutree.h2
-rw-r--r--kernel/rcutree_plugin.h33
-rw-r--r--kernel/sched/core.c92
-rw-r--r--kernel/sched/fair.c10
-rw-r--r--kernel/sched/idle_task.c1
-rw-r--r--kernel/sched/sched.h25
-rw-r--r--kernel/softirq.c19
-rw-r--r--kernel/time/Kconfig80
-rw-r--r--kernel/time/tick-broadcast.c3
-rw-r--r--kernel/time/tick-common.c5
-rw-r--r--kernel/time/tick-sched.c296
-rw-r--r--kernel/timer.c16
16 files changed, 607 insertions, 88 deletions
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3820e3cefbae..6b41c1899a8b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -18,6 +18,7 @@
18#include <linux/poll.h> 18#include <linux/poll.h>
19#include <linux/slab.h> 19#include <linux/slab.h>
20#include <linux/hash.h> 20#include <linux/hash.h>
21#include <linux/tick.h>
21#include <linux/sysfs.h> 22#include <linux/sysfs.h>
22#include <linux/dcache.h> 23#include <linux/dcache.h>
23#include <linux/percpu.h> 24#include <linux/percpu.h>
@@ -685,8 +686,12 @@ static void perf_pmu_rotate_start(struct pmu *pmu)
685 686
686 WARN_ON(!irqs_disabled()); 687 WARN_ON(!irqs_disabled());
687 688
688 if (list_empty(&cpuctx->rotation_list)) 689 if (list_empty(&cpuctx->rotation_list)) {
690 int was_empty = list_empty(head);
689 list_add(&cpuctx->rotation_list, head); 691 list_add(&cpuctx->rotation_list, head);
692 if (was_empty)
693 tick_nohz_full_kick();
694 }
690} 695}
691 696
692static void get_ctx(struct perf_event_context *ctx) 697static void get_ctx(struct perf_event_context *ctx)
@@ -2591,6 +2596,16 @@ done:
2591 list_del_init(&cpuctx->rotation_list); 2596 list_del_init(&cpuctx->rotation_list);
2592} 2597}
2593 2598
2599#ifdef CONFIG_NO_HZ_FULL
2600bool perf_event_can_stop_tick(void)
2601{
2602 if (list_empty(&__get_cpu_var(rotation_list)))
2603 return true;
2604 else
2605 return false;
2606}
2607#endif
2608
2594void perf_event_task_tick(void) 2609void perf_event_task_tick(void)
2595{ 2610{
2596 struct list_head *head = &__get_cpu_var(rotation_list); 2611 struct list_head *head = &__get_cpu_var(rotation_list);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 609d8ff38b74..fd4b13b131f8 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -172,7 +172,7 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
172 */ 172 */
173static int hrtimer_get_target(int this_cpu, int pinned) 173static int hrtimer_get_target(int this_cpu, int pinned)
174{ 174{
175#ifdef CONFIG_NO_HZ 175#ifdef CONFIG_NO_HZ_COMMON
176 if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu)) 176 if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
177 return get_nohz_timer_target(); 177 return get_nohz_timer_target();
178#endif 178#endif
@@ -1125,7 +1125,7 @@ ktime_t hrtimer_get_remaining(const struct hrtimer *timer)
1125} 1125}
1126EXPORT_SYMBOL_GPL(hrtimer_get_remaining); 1126EXPORT_SYMBOL_GPL(hrtimer_get_remaining);
1127 1127
1128#ifdef CONFIG_NO_HZ 1128#ifdef CONFIG_NO_HZ_COMMON
1129/** 1129/**
1130 * hrtimer_get_next_event - get the time until next expiry event 1130 * hrtimer_get_next_event - get the time until next expiry event
1131 * 1131 *
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 8fd709c9bb58..42670e9b44e0 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -10,6 +10,8 @@
10#include <linux/kernel_stat.h> 10#include <linux/kernel_stat.h>
11#include <trace/events/timer.h> 11#include <trace/events/timer.h>
12#include <linux/random.h> 12#include <linux/random.h>
13#include <linux/tick.h>
14#include <linux/workqueue.h>
13 15
14/* 16/*
15 * Called after updating RLIMIT_CPU to run cpu timer and update 17 * Called after updating RLIMIT_CPU to run cpu timer and update
@@ -153,6 +155,21 @@ static void bump_cpu_timer(struct k_itimer *timer,
153 } 155 }
154} 156}
155 157
158/**
159 * task_cputime_zero - Check a task_cputime struct for all zero fields.
160 *
161 * @cputime: The struct to compare.
162 *
163 * Checks @cputime to see if all fields are zero. Returns true if all fields
164 * are zero, false if any field is nonzero.
165 */
166static inline int task_cputime_zero(const struct task_cputime *cputime)
167{
168 if (!cputime->utime && !cputime->stime && !cputime->sum_exec_runtime)
169 return 1;
170 return 0;
171}
172
156static inline cputime_t prof_ticks(struct task_struct *p) 173static inline cputime_t prof_ticks(struct task_struct *p)
157{ 174{
158 cputime_t utime, stime; 175 cputime_t utime, stime;
@@ -636,6 +653,37 @@ static int cpu_timer_sample_group(const clockid_t which_clock,
636 return 0; 653 return 0;
637} 654}
638 655
656#ifdef CONFIG_NO_HZ_FULL
657static void nohz_kick_work_fn(struct work_struct *work)
658{
659 tick_nohz_full_kick_all();
660}
661
662static DECLARE_WORK(nohz_kick_work, nohz_kick_work_fn);
663
664/*
665 * We need the IPIs to be sent from sane process context.
666 * The posix cpu timers are always set with irqs disabled.
667 */
668static void posix_cpu_timer_kick_nohz(void)
669{
670 schedule_work(&nohz_kick_work);
671}
672
673bool posix_cpu_timers_can_stop_tick(struct task_struct *tsk)
674{
675 if (!task_cputime_zero(&tsk->cputime_expires))
676 return false;
677
678 if (tsk->signal->cputimer.running)
679 return false;
680
681 return true;
682}
683#else
684static inline void posix_cpu_timer_kick_nohz(void) { }
685#endif
686
639/* 687/*
640 * Guts of sys_timer_settime for CPU timers. 688 * Guts of sys_timer_settime for CPU timers.
641 * This is called with the timer locked and interrupts disabled. 689 * This is called with the timer locked and interrupts disabled.
@@ -794,6 +842,8 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int flags,
794 sample_to_timespec(timer->it_clock, 842 sample_to_timespec(timer->it_clock,
795 old_incr, &old->it_interval); 843 old_incr, &old->it_interval);
796 } 844 }
845 if (!ret)
846 posix_cpu_timer_kick_nohz();
797 return ret; 847 return ret;
798} 848}
799 849
@@ -1008,21 +1058,6 @@ static void check_cpu_itimer(struct task_struct *tsk, struct cpu_itimer *it,
1008 } 1058 }
1009} 1059}
1010 1060
1011/**
1012 * task_cputime_zero - Check a task_cputime struct for all zero fields.
1013 *
1014 * @cputime: The struct to compare.
1015 *
1016 * Checks @cputime to see if all fields are zero. Returns true if all fields
1017 * are zero, false if any field is nonzero.
1018 */
1019static inline int task_cputime_zero(const struct task_cputime *cputime)
1020{
1021 if (!cputime->utime && !cputime->stime && !cputime->sum_exec_runtime)
1022 return 1;
1023 return 0;
1024}
1025
1026/* 1061/*
1027 * Check for any per-thread CPU timers that have fired and move them 1062 * Check for any per-thread CPU timers that have fired and move them
1028 * off the tsk->*_timers list onto the firing list. Per-thread timers 1063 * off the tsk->*_timers list onto the firing list. Per-thread timers
@@ -1336,6 +1371,13 @@ void run_posix_cpu_timers(struct task_struct *tsk)
1336 cpu_timer_fire(timer); 1371 cpu_timer_fire(timer);
1337 spin_unlock(&timer->it_lock); 1372 spin_unlock(&timer->it_lock);
1338 } 1373 }
1374
1375 /*
1376 * In case some timers were rescheduled after the queue got emptied,
1377 * wake up full dynticks CPUs.
1378 */
1379 if (tsk->signal->cputimer.running)
1380 posix_cpu_timer_kick_nohz();
1339} 1381}
1340 1382
1341/* 1383/*
@@ -1366,7 +1408,7 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
1366 } 1408 }
1367 1409
1368 if (!*newval) 1410 if (!*newval)
1369 return; 1411 goto out;
1370 *newval += now.cpu; 1412 *newval += now.cpu;
1371 } 1413 }
1372 1414
@@ -1384,6 +1426,8 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
1384 tsk->signal->cputime_expires.virt_exp = *newval; 1426 tsk->signal->cputime_expires.virt_exp = *newval;
1385 break; 1427 break;
1386 } 1428 }
1429out:
1430 posix_cpu_timer_kick_nohz();
1387} 1431}
1388 1432
1389static int do_cpu_nanosleep(const clockid_t which_clock, int flags, 1433static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d8534308fd05..16ea67925015 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -799,6 +799,16 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
799 rdp->offline_fqs++; 799 rdp->offline_fqs++;
800 return 1; 800 return 1;
801 } 801 }
802
803 /*
804 * There is a possibility that a CPU in adaptive-ticks state
805 * might run in the kernel with the scheduling-clock tick disabled
806 * for an extended time period. Invoke rcu_kick_nohz_cpu() to
807 * force the CPU to restart the scheduling-clock tick in this
808 * CPU is in this state.
809 */
810 rcu_kick_nohz_cpu(rdp->cpu);
811
802 return 0; 812 return 0;
803} 813}
804 814
@@ -1820,7 +1830,7 @@ rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
1820 struct rcu_node *rnp, struct rcu_data *rdp) 1830 struct rcu_node *rnp, struct rcu_data *rdp)
1821{ 1831{
1822 /* No-CBs CPUs do not have orphanable callbacks. */ 1832 /* No-CBs CPUs do not have orphanable callbacks. */
1823 if (is_nocb_cpu(rdp->cpu)) 1833 if (rcu_is_nocb_cpu(rdp->cpu))
1824 return; 1834 return;
1825 1835
1826 /* 1836 /*
@@ -2892,10 +2902,10 @@ static void _rcu_barrier(struct rcu_state *rsp)
2892 * corresponding CPU's preceding callbacks have been invoked. 2902 * corresponding CPU's preceding callbacks have been invoked.
2893 */ 2903 */
2894 for_each_possible_cpu(cpu) { 2904 for_each_possible_cpu(cpu) {
2895 if (!cpu_online(cpu) && !is_nocb_cpu(cpu)) 2905 if (!cpu_online(cpu) && !rcu_is_nocb_cpu(cpu))
2896 continue; 2906 continue;
2897 rdp = per_cpu_ptr(rsp->rda, cpu); 2907 rdp = per_cpu_ptr(rsp->rda, cpu);
2898 if (is_nocb_cpu(cpu)) { 2908 if (rcu_is_nocb_cpu(cpu)) {
2899 _rcu_barrier_trace(rsp, "OnlineNoCB", cpu, 2909 _rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
2900 rsp->n_barrier_done); 2910 rsp->n_barrier_done);
2901 atomic_inc(&rsp->barrier_cpu_count); 2911 atomic_inc(&rsp->barrier_cpu_count);
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 14ee40795d6f..da77a8f57ff9 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -530,13 +530,13 @@ static int rcu_nocb_needs_gp(struct rcu_state *rsp);
530static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq); 530static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq);
531static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp); 531static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp);
532static void rcu_init_one_nocb(struct rcu_node *rnp); 532static void rcu_init_one_nocb(struct rcu_node *rnp);
533static bool is_nocb_cpu(int cpu);
534static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, 533static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
535 bool lazy); 534 bool lazy);
536static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp, 535static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
537 struct rcu_data *rdp); 536 struct rcu_data *rdp);
538static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp); 537static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
539static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp); 538static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
539static void rcu_kick_nohz_cpu(int cpu);
540static bool init_nocb_callback_list(struct rcu_data *rdp); 540static bool init_nocb_callback_list(struct rcu_data *rdp);
541 541
542#endif /* #ifndef RCU_TREE_NONCORE */ 542#endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index d084ae3f281c..170814dc418f 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -28,6 +28,7 @@
28#include <linux/gfp.h> 28#include <linux/gfp.h>
29#include <linux/oom.h> 29#include <linux/oom.h>
30#include <linux/smpboot.h> 30#include <linux/smpboot.h>
31#include <linux/tick.h>
31 32
32#define RCU_KTHREAD_PRIO 1 33#define RCU_KTHREAD_PRIO 1
33 34
@@ -1705,7 +1706,7 @@ static void rcu_prepare_for_idle(int cpu)
1705 return; 1706 return;
1706 1707
1707 /* If this is a no-CBs CPU, no callbacks, just return. */ 1708 /* If this is a no-CBs CPU, no callbacks, just return. */
1708 if (is_nocb_cpu(cpu)) 1709 if (rcu_is_nocb_cpu(cpu))
1709 return; 1710 return;
1710 1711
1711 /* 1712 /*
@@ -1747,7 +1748,7 @@ static void rcu_cleanup_after_idle(int cpu)
1747 struct rcu_data *rdp; 1748 struct rcu_data *rdp;
1748 struct rcu_state *rsp; 1749 struct rcu_state *rsp;
1749 1750
1750 if (is_nocb_cpu(cpu)) 1751 if (rcu_is_nocb_cpu(cpu))
1751 return; 1752 return;
1752 rcu_try_advance_all_cbs(); 1753 rcu_try_advance_all_cbs();
1753 for_each_rcu_flavor(rsp) { 1754 for_each_rcu_flavor(rsp) {
@@ -2052,7 +2053,7 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
2052} 2053}
2053 2054
2054/* Is the specified CPU a no-CPUs CPU? */ 2055/* Is the specified CPU a no-CPUs CPU? */
2055static bool is_nocb_cpu(int cpu) 2056bool rcu_is_nocb_cpu(int cpu)
2056{ 2057{
2057 if (have_rcu_nocb_mask) 2058 if (have_rcu_nocb_mask)
2058 return cpumask_test_cpu(cpu, rcu_nocb_mask); 2059 return cpumask_test_cpu(cpu, rcu_nocb_mask);
@@ -2110,7 +2111,7 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
2110 bool lazy) 2111 bool lazy)
2111{ 2112{
2112 2113
2113 if (!is_nocb_cpu(rdp->cpu)) 2114 if (!rcu_is_nocb_cpu(rdp->cpu))
2114 return 0; 2115 return 0;
2115 __call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy); 2116 __call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy);
2116 if (__is_kfree_rcu_offset((unsigned long)rhp->func)) 2117 if (__is_kfree_rcu_offset((unsigned long)rhp->func))
@@ -2134,7 +2135,7 @@ static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp,
2134 long qll = rsp->qlen_lazy; 2135 long qll = rsp->qlen_lazy;
2135 2136
2136 /* If this is not a no-CBs CPU, tell the caller to do it the old way. */ 2137 /* If this is not a no-CBs CPU, tell the caller to do it the old way. */
2137 if (!is_nocb_cpu(smp_processor_id())) 2138 if (!rcu_is_nocb_cpu(smp_processor_id()))
2138 return 0; 2139 return 0;
2139 rsp->qlen = 0; 2140 rsp->qlen = 0;
2140 rsp->qlen_lazy = 0; 2141 rsp->qlen_lazy = 0;
@@ -2306,11 +2307,6 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
2306{ 2307{
2307} 2308}
2308 2309
2309static bool is_nocb_cpu(int cpu)
2310{
2311 return false;
2312}
2313
2314static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, 2310static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
2315 bool lazy) 2311 bool lazy)
2316{ 2312{
@@ -2337,3 +2333,20 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
2337} 2333}
2338 2334
2339#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ 2335#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
2336
2337/*
2338 * An adaptive-ticks CPU can potentially execute in kernel mode for an
2339 * arbitrarily long period of time with the scheduling-clock tick turned
2340 * off. RCU will be paying attention to this CPU because it is in the
2341 * kernel, but the CPU cannot be guaranteed to be executing the RCU state
2342 * machine because the scheduling-clock tick has been disabled. Therefore,
2343 * if an adaptive-ticks CPU is failing to respond to the current grace
2344 * period and has not be idle from an RCU perspective, kick it.
2345 */
2346static void rcu_kick_nohz_cpu(int cpu)
2347{
2348#ifdef CONFIG_NO_HZ_FULL
2349 if (tick_nohz_full_cpu(cpu))
2350 smp_send_reschedule(cpu);
2351#endif /* #ifdef CONFIG_NO_HZ_FULL */
2352}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5662f58f0b69..58453b8272fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -544,7 +544,7 @@ void resched_cpu(int cpu)
544 raw_spin_unlock_irqrestore(&rq->lock, flags); 544 raw_spin_unlock_irqrestore(&rq->lock, flags);
545} 545}
546 546
547#ifdef CONFIG_NO_HZ 547#ifdef CONFIG_NO_HZ_COMMON
548/* 548/*
549 * In the semi idle case, use the nearest busy cpu for migrating timers 549 * In the semi idle case, use the nearest busy cpu for migrating timers
550 * from an idle cpu. This is good for power-savings. 550 * from an idle cpu. This is good for power-savings.
@@ -582,7 +582,7 @@ unlock:
582 * account when the CPU goes back to idle and evaluates the timer 582 * account when the CPU goes back to idle and evaluates the timer
583 * wheel for the next timer event. 583 * wheel for the next timer event.
584 */ 584 */
585void wake_up_idle_cpu(int cpu) 585static void wake_up_idle_cpu(int cpu)
586{ 586{
587 struct rq *rq = cpu_rq(cpu); 587 struct rq *rq = cpu_rq(cpu);
588 588
@@ -612,20 +612,56 @@ void wake_up_idle_cpu(int cpu)
612 smp_send_reschedule(cpu); 612 smp_send_reschedule(cpu);
613} 613}
614 614
615static bool wake_up_full_nohz_cpu(int cpu)
616{
617 if (tick_nohz_full_cpu(cpu)) {
618 if (cpu != smp_processor_id() ||
619 tick_nohz_tick_stopped())
620 smp_send_reschedule(cpu);
621 return true;
622 }
623
624 return false;
625}
626
627void wake_up_nohz_cpu(int cpu)
628{
629 if (!wake_up_full_nohz_cpu(cpu))
630 wake_up_idle_cpu(cpu);
631}
632
615static inline bool got_nohz_idle_kick(void) 633static inline bool got_nohz_idle_kick(void)
616{ 634{
617 int cpu = smp_processor_id(); 635 int cpu = smp_processor_id();
618 return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu)); 636 return idle_cpu(cpu) && test_bit(NOHZ_BALANCE_KICK, nohz_flags(cpu));
619} 637}
620 638
621#else /* CONFIG_NO_HZ */ 639#else /* CONFIG_NO_HZ_COMMON */
622 640
623static inline bool got_nohz_idle_kick(void) 641static inline bool got_nohz_idle_kick(void)
624{ 642{
625 return false; 643 return false;
626} 644}
627 645
628#endif /* CONFIG_NO_HZ */ 646#endif /* CONFIG_NO_HZ_COMMON */
647
648#ifdef CONFIG_NO_HZ_FULL
649bool sched_can_stop_tick(void)
650{
651 struct rq *rq;
652
653 rq = this_rq();
654
655 /* Make sure rq->nr_running update is visible after the IPI */
656 smp_rmb();
657
658 /* More than one running task need preemption */
659 if (rq->nr_running > 1)
660 return false;
661
662 return true;
663}
664#endif /* CONFIG_NO_HZ_FULL */
629 665
630void sched_avg_update(struct rq *rq) 666void sched_avg_update(struct rq *rq)
631{ 667{
@@ -1357,7 +1393,8 @@ static void sched_ttwu_pending(void)
1357 1393
1358void scheduler_ipi(void) 1394void scheduler_ipi(void)
1359{ 1395{
1360 if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()) 1396 if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick()
1397 && !tick_nohz_full_cpu(smp_processor_id()))
1361 return; 1398 return;
1362 1399
1363 /* 1400 /*
@@ -1374,6 +1411,7 @@ void scheduler_ipi(void)
1374 * somewhat pessimize the simple resched case. 1411 * somewhat pessimize the simple resched case.
1375 */ 1412 */
1376 irq_enter(); 1413 irq_enter();
1414 tick_nohz_full_check();
1377 sched_ttwu_pending(); 1415 sched_ttwu_pending();
1378 1416
1379 /* 1417 /*
@@ -1855,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
1855 kprobe_flush_task(prev); 1893 kprobe_flush_task(prev);
1856 put_task_struct(prev); 1894 put_task_struct(prev);
1857 } 1895 }
1896
1897 tick_nohz_task_switch(current);
1858} 1898}
1859 1899
1860#ifdef CONFIG_SMP 1900#ifdef CONFIG_SMP
@@ -2118,7 +2158,7 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
2118 return load >> FSHIFT; 2158 return load >> FSHIFT;
2119} 2159}
2120 2160
2121#ifdef CONFIG_NO_HZ 2161#ifdef CONFIG_NO_HZ_COMMON
2122/* 2162/*
2123 * Handle NO_HZ for the global load-average. 2163 * Handle NO_HZ for the global load-average.
2124 * 2164 *
@@ -2344,12 +2384,12 @@ static void calc_global_nohz(void)
2344 smp_wmb(); 2384 smp_wmb();
2345 calc_load_idx++; 2385 calc_load_idx++;
2346} 2386}
2347#else /* !CONFIG_NO_HZ */ 2387#else /* !CONFIG_NO_HZ_COMMON */
2348 2388
2349static inline long calc_load_fold_idle(void) { return 0; } 2389static inline long calc_load_fold_idle(void) { return 0; }
2350static inline void calc_global_nohz(void) { } 2390static inline void calc_global_nohz(void) { }
2351 2391
2352#endif /* CONFIG_NO_HZ */ 2392#endif /* CONFIG_NO_HZ_COMMON */
2353 2393
2354/* 2394/*
2355 * calc_load - update the avenrun load estimates 10 ticks after the 2395 * calc_load - update the avenrun load estimates 10 ticks after the
@@ -2509,7 +2549,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
2509 sched_avg_update(this_rq); 2549 sched_avg_update(this_rq);
2510} 2550}
2511 2551
2512#ifdef CONFIG_NO_HZ 2552#ifdef CONFIG_NO_HZ_COMMON
2513/* 2553/*
2514 * There is no sane way to deal with nohz on smp when using jiffies because the 2554 * There is no sane way to deal with nohz on smp when using jiffies because the
2515 * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading 2555 * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading
@@ -2569,7 +2609,7 @@ void update_cpu_load_nohz(void)
2569 } 2609 }
2570 raw_spin_unlock(&this_rq->lock); 2610 raw_spin_unlock(&this_rq->lock);
2571} 2611}
2572#endif /* CONFIG_NO_HZ */ 2612#endif /* CONFIG_NO_HZ_COMMON */
2573 2613
2574/* 2614/*
2575 * Called from scheduler_tick() 2615 * Called from scheduler_tick()
@@ -2696,7 +2736,34 @@ void scheduler_tick(void)
2696 rq->idle_balance = idle_cpu(cpu); 2736 rq->idle_balance = idle_cpu(cpu);
2697 trigger_load_balance(rq, cpu); 2737 trigger_load_balance(rq, cpu);
2698#endif 2738#endif
2739 rq_last_tick_reset(rq);
2740}
2741
2742#ifdef CONFIG_NO_HZ_FULL
2743/**
2744 * scheduler_tick_max_deferment
2745 *
2746 * Keep at least one tick per second when a single
2747 * active task is running because the scheduler doesn't
2748 * yet completely support full dynticks environment.
2749 *
2750 * This makes sure that uptime, CFS vruntime, load
2751 * balancing, etc... continue to move forward, even
2752 * with a very low granularity.
2753 */
2754u64 scheduler_tick_max_deferment(void)
2755{
2756 struct rq *rq = this_rq();
2757 unsigned long next, now = ACCESS_ONCE(jiffies);
2758
2759 next = rq->last_sched_tick + HZ;
2760
2761 if (time_before_eq(next, now))
2762 return 0;
2763
2764 return jiffies_to_usecs(next - now) * NSEC_PER_USEC;
2699} 2765}
2766#endif
2700 2767
2701notrace unsigned long get_parent_ip(unsigned long addr) 2768notrace unsigned long get_parent_ip(unsigned long addr)
2702{ 2769{
@@ -6951,9 +7018,12 @@ void __init sched_init(void)
6951 INIT_LIST_HEAD(&rq->cfs_tasks); 7018 INIT_LIST_HEAD(&rq->cfs_tasks);
6952 7019
6953 rq_attach_root(rq, &def_root_domain); 7020 rq_attach_root(rq, &def_root_domain);
6954#ifdef CONFIG_NO_HZ 7021#ifdef CONFIG_NO_HZ_COMMON
6955 rq->nohz_flags = 0; 7022 rq->nohz_flags = 0;
6956#endif 7023#endif
7024#ifdef CONFIG_NO_HZ_FULL
7025 rq->last_sched_tick = 0;
7026#endif
6957#endif 7027#endif
6958 init_rq_hrtick(rq); 7028 init_rq_hrtick(rq);
6959 atomic_set(&rq->nr_iowait, 0); 7029 atomic_set(&rq->nr_iowait, 0);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8bf7081b1ec5..c61a614465c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5355,7 +5355,7 @@ out_unlock:
5355 return 0; 5355 return 0;
5356} 5356}
5357 5357
5358#ifdef CONFIG_NO_HZ 5358#ifdef CONFIG_NO_HZ_COMMON
5359/* 5359/*
5360 * idle load balancing details 5360 * idle load balancing details
5361 * - When one of the busy CPUs notice that there may be an idle rebalancing 5361 * - When one of the busy CPUs notice that there may be an idle rebalancing
@@ -5572,9 +5572,9 @@ out:
5572 rq->next_balance = next_balance; 5572 rq->next_balance = next_balance;
5573} 5573}
5574 5574
5575#ifdef CONFIG_NO_HZ 5575#ifdef CONFIG_NO_HZ_COMMON
5576/* 5576/*
5577 * In CONFIG_NO_HZ case, the idle balance kickee will do the 5577 * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
5578 * rebalancing for all the cpus for whom scheduler ticks are stopped. 5578 * rebalancing for all the cpus for whom scheduler ticks are stopped.
5579 */ 5579 */
5580static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) 5580static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
@@ -5717,7 +5717,7 @@ void trigger_load_balance(struct rq *rq, int cpu)
5717 if (time_after_eq(jiffies, rq->next_balance) && 5717 if (time_after_eq(jiffies, rq->next_balance) &&
5718 likely(!on_null_domain(cpu))) 5718 likely(!on_null_domain(cpu)))
5719 raise_softirq(SCHED_SOFTIRQ); 5719 raise_softirq(SCHED_SOFTIRQ);
5720#ifdef CONFIG_NO_HZ 5720#ifdef CONFIG_NO_HZ_COMMON
5721 if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu))) 5721 if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
5722 nohz_balancer_kick(cpu); 5722 nohz_balancer_kick(cpu);
5723#endif 5723#endif
@@ -6187,7 +6187,7 @@ __init void init_sched_fair_class(void)
6187#ifdef CONFIG_SMP 6187#ifdef CONFIG_SMP
6188 open_softirq(SCHED_SOFTIRQ, run_rebalance_domains); 6188 open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
6189 6189
6190#ifdef CONFIG_NO_HZ 6190#ifdef CONFIG_NO_HZ_COMMON
6191 nohz.next_balance = jiffies; 6191 nohz.next_balance = jiffies;
6192 zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); 6192 zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
6193 cpu_notifier(sched_ilb_notifier, 0); 6193 cpu_notifier(sched_ilb_notifier, 0);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b8ce77328341..d8da01008d39 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -17,6 +17,7 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
17static void pre_schedule_idle(struct rq *rq, struct task_struct *prev) 17static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
18{ 18{
19 idle_exit_fair(rq); 19 idle_exit_fair(rq);
20 rq_last_tick_reset(rq);
20} 21}
21 22
22static void post_schedule_idle(struct rq *rq) 23static void post_schedule_idle(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c225c4c7111..ce39224d6155 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -5,6 +5,7 @@
5#include <linux/mutex.h> 5#include <linux/mutex.h>
6#include <linux/spinlock.h> 6#include <linux/spinlock.h>
7#include <linux/stop_machine.h> 7#include <linux/stop_machine.h>
8#include <linux/tick.h>
8 9
9#include "cpupri.h" 10#include "cpupri.h"
10#include "cpuacct.h" 11#include "cpuacct.h"
@@ -405,10 +406,13 @@ struct rq {
405 #define CPU_LOAD_IDX_MAX 5 406 #define CPU_LOAD_IDX_MAX 5
406 unsigned long cpu_load[CPU_LOAD_IDX_MAX]; 407 unsigned long cpu_load[CPU_LOAD_IDX_MAX];
407 unsigned long last_load_update_tick; 408 unsigned long last_load_update_tick;
408#ifdef CONFIG_NO_HZ 409#ifdef CONFIG_NO_HZ_COMMON
409 u64 nohz_stamp; 410 u64 nohz_stamp;
410 unsigned long nohz_flags; 411 unsigned long nohz_flags;
411#endif 412#endif
413#ifdef CONFIG_NO_HZ_FULL
414 unsigned long last_sched_tick;
415#endif
412 int skip_clock_update; 416 int skip_clock_update;
413 417
414 /* capture load from *all* tasks on this cpu: */ 418 /* capture load from *all* tasks on this cpu: */
@@ -1072,6 +1076,16 @@ static inline u64 steal_ticks(u64 steal)
1072static inline void inc_nr_running(struct rq *rq) 1076static inline void inc_nr_running(struct rq *rq)
1073{ 1077{
1074 rq->nr_running++; 1078 rq->nr_running++;
1079
1080#ifdef CONFIG_NO_HZ_FULL
1081 if (rq->nr_running == 2) {
1082 if (tick_nohz_full_cpu(rq->cpu)) {
1083 /* Order rq->nr_running write against the IPI */
1084 smp_wmb();
1085 smp_send_reschedule(rq->cpu);
1086 }
1087 }
1088#endif
1075} 1089}
1076 1090
1077static inline void dec_nr_running(struct rq *rq) 1091static inline void dec_nr_running(struct rq *rq)
@@ -1079,6 +1093,13 @@ static inline void dec_nr_running(struct rq *rq)
1079 rq->nr_running--; 1093 rq->nr_running--;
1080} 1094}
1081 1095
1096static inline void rq_last_tick_reset(struct rq *rq)
1097{
1098#ifdef CONFIG_NO_HZ_FULL
1099 rq->last_sched_tick = jiffies;
1100#endif
1101}
1102
1082extern void update_rq_clock(struct rq *rq); 1103extern void update_rq_clock(struct rq *rq);
1083 1104
1084extern void activate_task(struct rq *rq, struct task_struct *p, int flags); 1105extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
@@ -1299,7 +1320,7 @@ extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
1299 1320
1300extern void account_cfs_bandwidth_used(int enabled, int was_enabled); 1321extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
1301 1322
1302#ifdef CONFIG_NO_HZ 1323#ifdef CONFIG_NO_HZ_COMMON
1303enum rq_nohz_flag_bits { 1324enum rq_nohz_flag_bits {
1304 NOHZ_TICK_STOPPED, 1325 NOHZ_TICK_STOPPED,
1305 NOHZ_BALANCE_KICK, 1326 NOHZ_BALANCE_KICK,
diff --git a/kernel/softirq.c b/kernel/softirq.c
index aa82723c7202..b5197dcb0dad 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -329,6 +329,19 @@ static inline void invoke_softirq(void)
329 wakeup_softirqd(); 329 wakeup_softirqd();
330} 330}
331 331
332static inline void tick_irq_exit(void)
333{
334#ifdef CONFIG_NO_HZ_COMMON
335 int cpu = smp_processor_id();
336
337 /* Make sure that timer wheel updates are propagated */
338 if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
339 if (!in_interrupt())
340 tick_nohz_irq_exit();
341 }
342#endif
343}
344
332/* 345/*
333 * Exit an interrupt context. Process softirqs if needed and possible: 346 * Exit an interrupt context. Process softirqs if needed and possible:
334 */ 347 */
@@ -346,11 +359,7 @@ void irq_exit(void)
346 if (!in_interrupt() && local_softirq_pending()) 359 if (!in_interrupt() && local_softirq_pending())
347 invoke_softirq(); 360 invoke_softirq();
348 361
349#ifdef CONFIG_NO_HZ 362 tick_irq_exit();
350 /* Make sure that timer wheel updates are propagated */
351 if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
352 tick_nohz_irq_exit();
353#endif
354 rcu_irq_exit(); 363 rcu_irq_exit();
355} 364}
356 365
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 24510d84efd7..e4c07b0692bb 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -64,20 +64,88 @@ config GENERIC_CMOS_UPDATE
64if GENERIC_CLOCKEVENTS 64if GENERIC_CLOCKEVENTS
65menu "Timers subsystem" 65menu "Timers subsystem"
66 66
67# Core internal switch. Selected by NO_HZ / HIGH_RES_TIMERS. This is 67# Core internal switch. Selected by NO_HZ_COMMON / HIGH_RES_TIMERS. This is
68# only related to the tick functionality. Oneshot clockevent devices 68# only related to the tick functionality. Oneshot clockevent devices
69# are supported independ of this. 69# are supported independ of this.
70config TICK_ONESHOT 70config TICK_ONESHOT
71 bool 71 bool
72 72
73config NO_HZ 73config NO_HZ_COMMON
74 bool "Tickless System (Dynamic Ticks)" 74 bool
75 depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS 75 depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
76 select TICK_ONESHOT 76 select TICK_ONESHOT
77
78choice
79 prompt "Timer tick handling"
80 default NO_HZ_IDLE if NO_HZ
81
82config HZ_PERIODIC
83 bool "Periodic timer ticks (constant rate, no dynticks)"
84 help
85 This option keeps the tick running periodically at a constant
86 rate, even when the CPU doesn't need it.
87
88config NO_HZ_IDLE
89 bool "Idle dynticks system (tickless idle)"
90 depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
91 select NO_HZ_COMMON
92 help
93 This option enables a tickless idle system: timer interrupts
94 will only trigger on an as-needed basis when the system is idle.
95 This is usually interesting for energy saving.
96
97 Most of the time you want to say Y here.
98
99config NO_HZ_FULL
100 bool "Full dynticks system (tickless)"
101 # NO_HZ_COMMON dependency
102 depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
103 # We need at least one periodic CPU for timekeeping
104 depends on SMP
105 # RCU_USER_QS dependency
106 depends on HAVE_CONTEXT_TRACKING
107 # VIRT_CPU_ACCOUNTING_GEN dependency
108 depends on 64BIT
109 select NO_HZ_COMMON
110 select RCU_USER_QS
111 select RCU_NOCB_CPU
112 select VIRT_CPU_ACCOUNTING_GEN
113 select CONTEXT_TRACKING_FORCE
114 select IRQ_WORK
115 help
116 Adaptively try to shutdown the tick whenever possible, even when
117 the CPU is running tasks. Typically this requires running a single
118 task on the CPU. Chances for running tickless are maximized when
119 the task mostly runs in userspace and has few kernel activity.
120
121 You need to fill up the nohz_full boot parameter with the
122 desired range of dynticks CPUs.
123
124 This is implemented at the expense of some overhead in user <-> kernel
125 transitions: syscalls, exceptions and interrupts. Even when it's
126 dynamically off.
127
128 Say N.
129
130endchoice
131
132config NO_HZ_FULL_ALL
133 bool "Full dynticks system on all CPUs by default"
134 depends on NO_HZ_FULL
135 help
136 If the user doesn't pass the nohz_full boot option to
137 define the range of full dynticks CPUs, consider that all
138 CPUs in the system are full dynticks by default.
139 Note the boot CPU will still be kept outside the range to
140 handle the timekeeping duty.
141
142config NO_HZ
143 bool "Old Idle dynticks config"
144 depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
77 help 145 help
78 This option enables a tickless system: timer interrupts will 146 This is the old config entry that enables dynticks idle.
79 only trigger on an as-needed basis both when the system is 147 We keep it around for a little while to enforce backward
80 busy and when the system is idle. 148 compatibility with older config files.
81 149
82config HIGH_RES_TIMERS 150config HIGH_RES_TIMERS
83 bool "High Resolution Timer Support" 151 bool "High Resolution Timer Support"
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 61d00a8cdf2f..206bbfb34e09 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -693,7 +693,8 @@ void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
693 bc->event_handler = tick_handle_oneshot_broadcast; 693 bc->event_handler = tick_handle_oneshot_broadcast;
694 694
695 /* Take the do_timer update */ 695 /* Take the do_timer update */
696 tick_do_timer_cpu = cpu; 696 if (!tick_nohz_full_cpu(cpu))
697 tick_do_timer_cpu = cpu;
697 698
698 /* 699 /*
699 * We must be careful here. There might be other CPUs 700 * We must be careful here. There might be other CPUs
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 6176a3e45709..5d3fb100bc06 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -163,7 +163,10 @@ static void tick_setup_device(struct tick_device *td,
163 * this cpu: 163 * this cpu:
164 */ 164 */
165 if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) { 165 if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
166 tick_do_timer_cpu = cpu; 166 if (!tick_nohz_full_cpu(cpu))
167 tick_do_timer_cpu = cpu;
168 else
169 tick_do_timer_cpu = TICK_DO_TIMER_NONE;
167 tick_next_period = ktime_get(); 170 tick_next_period = ktime_get();
168 tick_period = ktime_set(0, NSEC_PER_SEC / HZ); 171 tick_period = ktime_set(0, NSEC_PER_SEC / HZ);
169 } 172 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 225f8bf19095..bc67d4245e1d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -21,11 +21,15 @@
21#include <linux/sched.h> 21#include <linux/sched.h>
22#include <linux/module.h> 22#include <linux/module.h>
23#include <linux/irq_work.h> 23#include <linux/irq_work.h>
24#include <linux/posix-timers.h>
25#include <linux/perf_event.h>
24 26
25#include <asm/irq_regs.h> 27#include <asm/irq_regs.h>
26 28
27#include "tick-internal.h" 29#include "tick-internal.h"
28 30
31#include <trace/events/timer.h>
32
29/* 33/*
30 * Per cpu nohz control structure 34 * Per cpu nohz control structure
31 */ 35 */
@@ -104,7 +108,7 @@ static void tick_sched_do_timer(ktime_t now)
104{ 108{
105 int cpu = smp_processor_id(); 109 int cpu = smp_processor_id();
106 110
107#ifdef CONFIG_NO_HZ 111#ifdef CONFIG_NO_HZ_COMMON
108 /* 112 /*
109 * Check if the do_timer duty was dropped. We don't care about 113 * Check if the do_timer duty was dropped. We don't care about
110 * concurrency: This happens only when the cpu in charge went 114 * concurrency: This happens only when the cpu in charge went
@@ -112,7 +116,8 @@ static void tick_sched_do_timer(ktime_t now)
112 * this duty, then the jiffies update is still serialized by 116 * this duty, then the jiffies update is still serialized by
113 * jiffies_lock. 117 * jiffies_lock.
114 */ 118 */
115 if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) 119 if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)
120 && !tick_nohz_full_cpu(cpu))
116 tick_do_timer_cpu = cpu; 121 tick_do_timer_cpu = cpu;
117#endif 122#endif
118 123
@@ -123,7 +128,7 @@ static void tick_sched_do_timer(ktime_t now)
123 128
124static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs) 129static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
125{ 130{
126#ifdef CONFIG_NO_HZ 131#ifdef CONFIG_NO_HZ_COMMON
127 /* 132 /*
128 * When we are idle and the tick is stopped, we have to touch 133 * When we are idle and the tick is stopped, we have to touch
129 * the watchdog as we might not schedule for a really long 134 * the watchdog as we might not schedule for a really long
@@ -142,10 +147,226 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
142 profile_tick(CPU_PROFILING); 147 profile_tick(CPU_PROFILING);
143} 148}
144 149
150#ifdef CONFIG_NO_HZ_FULL
151static cpumask_var_t nohz_full_mask;
152bool have_nohz_full_mask;
153
154static bool can_stop_full_tick(void)
155{
156 WARN_ON_ONCE(!irqs_disabled());
157
158 if (!sched_can_stop_tick()) {
159 trace_tick_stop(0, "more than 1 task in runqueue\n");
160 return false;
161 }
162
163 if (!posix_cpu_timers_can_stop_tick(current)) {
164 trace_tick_stop(0, "posix timers running\n");
165 return false;
166 }
167
168 if (!perf_event_can_stop_tick()) {
169 trace_tick_stop(0, "perf events running\n");
170 return false;
171 }
172
173 /* sched_clock_tick() needs us? */
174#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
175 /*
176 * TODO: kick full dynticks CPUs when
177 * sched_clock_stable is set.
178 */
179 if (!sched_clock_stable) {
180 trace_tick_stop(0, "unstable sched clock\n");
181 return false;
182 }
183#endif
184
185 return true;
186}
187
188static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now);
189
190/*
191 * Re-evaluate the need for the tick on the current CPU
192 * and restart it if necessary.
193 */
194void tick_nohz_full_check(void)
195{
196 struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
197
198 if (tick_nohz_full_cpu(smp_processor_id())) {
199 if (ts->tick_stopped && !is_idle_task(current)) {
200 if (!can_stop_full_tick())
201 tick_nohz_restart_sched_tick(ts, ktime_get());
202 }
203 }
204}
205
206static void nohz_full_kick_work_func(struct irq_work *work)
207{
208 tick_nohz_full_check();
209}
210
211static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = {
212 .func = nohz_full_kick_work_func,
213};
214
215/*
216 * Kick the current CPU if it's full dynticks in order to force it to
217 * re-evaluate its dependency on the tick and restart it if necessary.
218 */
219void tick_nohz_full_kick(void)
220{
221 if (tick_nohz_full_cpu(smp_processor_id()))
222 irq_work_queue(&__get_cpu_var(nohz_full_kick_work));
223}
224
225static void nohz_full_kick_ipi(void *info)
226{
227 tick_nohz_full_check();
228}
229
230/*
231 * Kick all full dynticks CPUs in order to force these to re-evaluate
232 * their dependency on the tick and restart it if necessary.
233 */
234void tick_nohz_full_kick_all(void)
235{
236 if (!have_nohz_full_mask)
237 return;
238
239 preempt_disable();
240 smp_call_function_many(nohz_full_mask,
241 nohz_full_kick_ipi, NULL, false);
242 preempt_enable();
243}
244
245/*
246 * Re-evaluate the need for the tick as we switch the current task.
247 * It might need the tick due to per task/process properties:
248 * perf events, posix cpu timers, ...
249 */
250void tick_nohz_task_switch(struct task_struct *tsk)
251{
252 unsigned long flags;
253
254 local_irq_save(flags);
255
256 if (!tick_nohz_full_cpu(smp_processor_id()))
257 goto out;
258
259 if (tick_nohz_tick_stopped() && !can_stop_full_tick())
260 tick_nohz_full_kick();
261
262out:
263 local_irq_restore(flags);
264}
265
266int tick_nohz_full_cpu(int cpu)
267{
268 if (!have_nohz_full_mask)
269 return 0;
270
271 return cpumask_test_cpu(cpu, nohz_full_mask);
272}
273
274/* Parse the boot-time nohz CPU list from the kernel parameters. */
275static int __init tick_nohz_full_setup(char *str)
276{
277 int cpu;
278
279 alloc_bootmem_cpumask_var(&nohz_full_mask);
280 if (cpulist_parse(str, nohz_full_mask) < 0) {
281 pr_warning("NOHZ: Incorrect nohz_full cpumask\n");
282 return 1;
283 }
284
285 cpu = smp_processor_id();
286 if (cpumask_test_cpu(cpu, nohz_full_mask)) {
287 pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu);
288 cpumask_clear_cpu(cpu, nohz_full_mask);
289 }
290 have_nohz_full_mask = true;
291
292 return 1;
293}
294__setup("nohz_full=", tick_nohz_full_setup);
295
296static int __cpuinit tick_nohz_cpu_down_callback(struct notifier_block *nfb,
297 unsigned long action,
298 void *hcpu)
299{
300 unsigned int cpu = (unsigned long)hcpu;
301
302 switch (action & ~CPU_TASKS_FROZEN) {
303 case CPU_DOWN_PREPARE:
304 /*
305 * If we handle the timekeeping duty for full dynticks CPUs,
306 * we can't safely shutdown that CPU.
307 */
308 if (have_nohz_full_mask && tick_do_timer_cpu == cpu)
309 return -EINVAL;
310 break;
311 }
312 return NOTIFY_OK;
313}
314
315/*
316 * Worst case string length in chunks of CPU range seems 2 steps
317 * separations: 0,2,4,6,...
318 * This is NR_CPUS + sizeof('\0')
319 */
320static char __initdata nohz_full_buf[NR_CPUS + 1];
321
322static int tick_nohz_init_all(void)
323{
324 int err = -1;
325
326#ifdef CONFIG_NO_HZ_FULL_ALL
327 if (!alloc_cpumask_var(&nohz_full_mask, GFP_KERNEL)) {
328 pr_err("NO_HZ: Can't allocate full dynticks cpumask\n");
329 return err;
330 }
331 err = 0;
332 cpumask_setall(nohz_full_mask);
333 cpumask_clear_cpu(smp_processor_id(), nohz_full_mask);
334 have_nohz_full_mask = true;
335#endif
336 return err;
337}
338
339void __init tick_nohz_init(void)
340{
341 int cpu;
342
343 if (!have_nohz_full_mask) {
344 if (tick_nohz_init_all() < 0)
345 return;
346 }
347
348 cpu_notifier(tick_nohz_cpu_down_callback, 0);
349
350 /* Make sure full dynticks CPU are also RCU nocbs */
351 for_each_cpu(cpu, nohz_full_mask) {
352 if (!rcu_is_nocb_cpu(cpu)) {
353 pr_warning("NO_HZ: CPU %d is not RCU nocb: "
354 "cleared from nohz_full range", cpu);
355 cpumask_clear_cpu(cpu, nohz_full_mask);
356 }
357 }
358
359 cpulist_scnprintf(nohz_full_buf, sizeof(nohz_full_buf), nohz_full_mask);
360 pr_info("NO_HZ: Full dynticks CPUs: %s.\n", nohz_full_buf);
361}
362#else
363#define have_nohz_full_mask (0)
364#endif
365
145/* 366/*
146 * NOHZ - aka dynamic tick functionality 367 * NOHZ - aka dynamic tick functionality
147 */ 368 */
148#ifdef CONFIG_NO_HZ 369#ifdef CONFIG_NO_HZ_COMMON
149/* 370/*
150 * NO HZ enabled ? 371 * NO HZ enabled ?
151 */ 372 */
@@ -345,11 +566,12 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
345 delta_jiffies = rcu_delta_jiffies; 566 delta_jiffies = rcu_delta_jiffies;
346 } 567 }
347 } 568 }
569
348 /* 570 /*
349 * Do not stop the tick, if we are only one off 571 * Do not stop the tick, if we are only one off (or less)
350 * or if the cpu is required for rcu 572 * or if the cpu is required for RCU:
351 */ 573 */
352 if (!ts->tick_stopped && delta_jiffies == 1) 574 if (!ts->tick_stopped && delta_jiffies <= 1)
353 goto out; 575 goto out;
354 576
355 /* Schedule the tick, if we are at least one jiffie off */ 577 /* Schedule the tick, if we are at least one jiffie off */
@@ -378,6 +600,13 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
378 time_delta = KTIME_MAX; 600 time_delta = KTIME_MAX;
379 } 601 }
380 602
603#ifdef CONFIG_NO_HZ_FULL
604 if (!ts->inidle) {
605 time_delta = min(time_delta,
606 scheduler_tick_max_deferment());
607 }
608#endif
609
381 /* 610 /*
382 * calculate the expiry time for the next timer wheel 611 * calculate the expiry time for the next timer wheel
383 * timer. delta_jiffies >= NEXT_TIMER_MAX_DELTA signals 612 * timer. delta_jiffies >= NEXT_TIMER_MAX_DELTA signals
@@ -421,6 +650,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
421 650
422 ts->last_tick = hrtimer_get_expires(&ts->sched_timer); 651 ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
423 ts->tick_stopped = 1; 652 ts->tick_stopped = 1;
653 trace_tick_stop(1, " ");
424 } 654 }
425 655
426 /* 656 /*
@@ -457,6 +687,24 @@ out:
457 return ret; 687 return ret;
458} 688}
459 689
690static void tick_nohz_full_stop_tick(struct tick_sched *ts)
691{
692#ifdef CONFIG_NO_HZ_FULL
693 int cpu = smp_processor_id();
694
695 if (!tick_nohz_full_cpu(cpu) || is_idle_task(current))
696 return;
697
698 if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
699 return;
700
701 if (!can_stop_full_tick())
702 return;
703
704 tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
705#endif
706}
707
460static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) 708static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
461{ 709{
462 /* 710 /*
@@ -489,6 +737,21 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
489 return false; 737 return false;
490 } 738 }
491 739
740 if (have_nohz_full_mask) {
741 /*
742 * Keep the tick alive to guarantee timekeeping progression
743 * if there are full dynticks CPUs around
744 */
745 if (tick_do_timer_cpu == cpu)
746 return false;
747 /*
748 * Boot safety: make sure the timekeeping duty has been
749 * assigned before entering dyntick-idle mode,
750 */
751 if (tick_do_timer_cpu == TICK_DO_TIMER_NONE)
752 return false;
753 }
754
492 return true; 755 return true;
493} 756}
494 757
@@ -568,12 +831,13 @@ void tick_nohz_irq_exit(void)
568{ 831{
569 struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched); 832 struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
570 833
571 if (!ts->inidle) 834 if (ts->inidle) {
572 return; 835 /* Cancel the timer because CPU already waken up from the C-states*/
573 836 menu_hrtimer_cancel();
574 /* Cancel the timer because CPU already waken up from the C-states*/ 837 __tick_nohz_idle_enter(ts);
575 menu_hrtimer_cancel(); 838 } else {
576 __tick_nohz_idle_enter(ts); 839 tick_nohz_full_stop_tick(ts);
840 }
577} 841}
578 842
579/** 843/**
@@ -802,7 +1066,7 @@ static inline void tick_check_nohz(int cpu)
802static inline void tick_nohz_switch_to_nohz(void) { } 1066static inline void tick_nohz_switch_to_nohz(void) { }
803static inline void tick_check_nohz(int cpu) { } 1067static inline void tick_check_nohz(int cpu) { }
804 1068
805#endif /* NO_HZ */ 1069#endif /* CONFIG_NO_HZ_COMMON */
806 1070
807/* 1071/*
808 * Called from irq_enter to notify about the possible interruption of idle() 1072 * Called from irq_enter to notify about the possible interruption of idle()
@@ -887,14 +1151,14 @@ void tick_setup_sched_timer(void)
887 now = ktime_get(); 1151 now = ktime_get();
888 } 1152 }
889 1153
890#ifdef CONFIG_NO_HZ 1154#ifdef CONFIG_NO_HZ_COMMON
891 if (tick_nohz_enabled) 1155 if (tick_nohz_enabled)
892 ts->nohz_mode = NOHZ_MODE_HIGHRES; 1156 ts->nohz_mode = NOHZ_MODE_HIGHRES;
893#endif 1157#endif
894} 1158}
895#endif /* HIGH_RES_TIMERS */ 1159#endif /* HIGH_RES_TIMERS */
896 1160
897#if defined CONFIG_NO_HZ || defined CONFIG_HIGH_RES_TIMERS 1161#if defined CONFIG_NO_HZ_COMMON || defined CONFIG_HIGH_RES_TIMERS
898void tick_cancel_sched_timer(int cpu) 1162void tick_cancel_sched_timer(int cpu)
899{ 1163{
900 struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu); 1164 struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
diff --git a/kernel/timer.c b/kernel/timer.c
index 09bca8ce9771..a860bba34412 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -739,7 +739,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
739 739
740 cpu = smp_processor_id(); 740 cpu = smp_processor_id();
741 741
742#if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP) 742#if defined(CONFIG_NO_HZ_COMMON) && defined(CONFIG_SMP)
743 if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) 743 if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu))
744 cpu = get_nohz_timer_target(); 744 cpu = get_nohz_timer_target();
745#endif 745#endif
@@ -931,14 +931,14 @@ void add_timer_on(struct timer_list *timer, int cpu)
931 debug_activate(timer, timer->expires); 931 debug_activate(timer, timer->expires);
932 internal_add_timer(base, timer); 932 internal_add_timer(base, timer);
933 /* 933 /*
934 * Check whether the other CPU is idle and needs to be 934 * Check whether the other CPU is in dynticks mode and needs
935 * triggered to reevaluate the timer wheel when nohz is 935 * to be triggered to reevaluate the timer wheel.
936 * active. We are protected against the other CPU fiddling 936 * We are protected against the other CPU fiddling
937 * with the timer by holding the timer base lock. This also 937 * with the timer by holding the timer base lock. This also
938 * makes sure that a CPU on the way to idle can not evaluate 938 * makes sure that a CPU on the way to stop its tick can not
939 * the timer wheel. 939 * evaluate the timer wheel.
940 */ 940 */
941 wake_up_idle_cpu(cpu); 941 wake_up_nohz_cpu(cpu);
942 spin_unlock_irqrestore(&base->lock, flags); 942 spin_unlock_irqrestore(&base->lock, flags);
943} 943}
944EXPORT_SYMBOL_GPL(add_timer_on); 944EXPORT_SYMBOL_GPL(add_timer_on);
@@ -1189,7 +1189,7 @@ static inline void __run_timers(struct tvec_base *base)
1189 spin_unlock_irq(&base->lock); 1189 spin_unlock_irq(&base->lock);
1190} 1190}
1191 1191
1192#ifdef CONFIG_NO_HZ 1192#ifdef CONFIG_NO_HZ_COMMON
1193/* 1193/*
1194 * Find out when the next timer event is due to happen. This 1194 * Find out when the next timer event is due to happen. This
1195 * is used on S/390 to stop all activity when a CPU is idle. 1195 * is used on S/390 to stop all activity when a CPU is idle.