aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@woody.linux-foundation.org>2007-08-09 11:23:31 -0400
committerLinus Torvalds <torvalds@woody.linux-foundation.org>2007-08-09 11:23:31 -0400
commitbe12014dd7750648fde33e1e45cac24dc9a8be6d (patch)
treecaf9715d2c37f3c08c25e51c9f416c2bbf956236
parente3bcf5e2785aa49f75f36a8d27d601891a7ff12b (diff)
parent7cff8cf61cac15fa29a1ca802826d2bcbca66152 (diff)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: (61 commits) sched: refine negative nice level granularity sched: fix update_stats_enqueue() reniced codepath sched: round a bit better sched: make the multiplication table more accurate sched: optimize update_rq_clock() calls in the load-balancer sched: optimize activate_task() sched: clean up set_curr_task_fair() sched: remove __update_rq_clock() call from entity_tick() sched: move the __update_rq_clock() call to scheduler_tick() sched debug: remove the 'u64 now' parameter from print_task()/_rq() sched: remove the 'u64 now' local variables sched: remove the 'u64 now' parameter from deactivate_task() sched: remove the 'u64 now' parameter from dequeue_task() sched: remove the 'u64 now' parameter from enqueue_task() sched: remove the 'u64 now' parameter from dec_nr_running() sched: remove the 'u64 now' parameter from inc_nr_running() sched: remove the 'u64 now' parameter from dec_load() sched: remove the 'u64 now' parameter from inc_load() sched: remove the 'u64 now' parameter from update_curr_load() sched: remove the 'u64 now' parameter from ->task_new() ...
-rw-r--r--Documentation/sched-design-CFS.txt2
-rw-r--r--Documentation/sched-nice-design.txt108
-rw-r--r--include/linux/sched.h20
-rw-r--r--kernel/sched.c339
-rw-r--r--kernel/sched_debug.c16
-rw-r--r--kernel/sched_fair.c212
-rw-r--r--kernel/sched_idletask.c10
-rw-r--r--kernel/sched_rt.c48
8 files changed, 421 insertions, 334 deletions
diff --git a/Documentation/sched-design-CFS.txt b/Documentation/sched-design-CFS.txt
index 16feebb7bdc0..84901e7c0508 100644
--- a/Documentation/sched-design-CFS.txt
+++ b/Documentation/sched-design-CFS.txt
@@ -83,7 +83,7 @@ Some implementation details:
83 CFS uses nanosecond granularity accounting and does not rely on any 83 CFS uses nanosecond granularity accounting and does not rely on any
84 jiffies or other HZ detail. Thus the CFS scheduler has no notion of 84 jiffies or other HZ detail. Thus the CFS scheduler has no notion of
85 'timeslices' and has no heuristics whatsoever. There is only one 85 'timeslices' and has no heuristics whatsoever. There is only one
86 central tunable: 86 central tunable (you have to switch on CONFIG_SCHED_DEBUG):
87 87
88 /proc/sys/kernel/sched_granularity_ns 88 /proc/sys/kernel/sched_granularity_ns
89 89
diff --git a/Documentation/sched-nice-design.txt b/Documentation/sched-nice-design.txt
new file mode 100644
index 000000000000..e2bae5a577e3
--- /dev/null
+++ b/Documentation/sched-nice-design.txt
@@ -0,0 +1,108 @@
1This document explains the thinking about the revamped and streamlined
2nice-levels implementation in the new Linux scheduler.
3
4Nice levels were always pretty weak under Linux and people continuously
5pestered us to make nice +19 tasks use up much less CPU time.
6
7Unfortunately that was not that easy to implement under the old
8scheduler, (otherwise we'd have done it long ago) because nice level
9support was historically coupled to timeslice length, and timeslice
10units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
11
12In the O(1) scheduler (in 2003) we changed negative nice levels to be
13much stronger than they were before in 2.4 (and people were happy about
14that change), and we also intentionally calibrated the linear timeslice
15rule so that nice +19 level would be _exactly_ 1 jiffy. To better
16understand it, the timeslice graph went like this (cheesy ASCII art
17alert!):
18
19
20 A
21 \ | [timeslice length]
22 \ |
23 \ |
24 \ |
25 \ |
26 \|___100msecs
27 |^ . _
28 | ^ . _
29 | ^ . _
30 -*----------------------------------*-----> [nice level]
31 -20 | +19
32 |
33 |
34
35So that if someone wanted to really renice tasks, +19 would give a much
36bigger hit than the normal linear rule would do. (The solution of
37changing the ABI to extend priorities was discarded early on.)
38
39This approach worked to some degree for some time, but later on with
40HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
41we felt to be a bit excessive. Excessive _not_ because it's too small of
42a CPU utilization, but because it causes too frequent (once per
43millisec) rescheduling. (and would thus trash the cache, etc. Remember,
44this was long ago when hardware was weaker and caches were smaller, and
45people were running number crunching apps at nice +19.)
46
47So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
48right minimal granularity - and this translates to 5% CPU utilization.
49But the fundamental HZ-sensitive property for nice+19 still remained,
50and we never got a single complaint about nice +19 being too _weak_ in
51terms of CPU utilization, we only got complaints about it (still) being
52too _strong_ :-)
53
54To sum it up: we always wanted to make nice levels more consistent, but
55within the constraints of HZ and jiffies and their nasty design level
56coupling to timeslices and granularity it was not really viable.
57
58The second (less frequent but still periodically occuring) complaint
59about Linux's nice level support was its assymetry around the origo
60(which you can see demonstrated in the picture above), or more
61accurately: the fact that nice level behavior depended on the _absolute_
62nice level as well, while the nice API itself is fundamentally
63"relative":
64
65 int nice(int inc);
66
67 asmlinkage long sys_nice(int increment)
68
69(the first one is the glibc API, the second one is the syscall API.)
70Note that the 'inc' is relative to the current nice level. Tools like
71bash's "nice" command mirror this relative API.
72
73With the old scheduler, if you for example started a niced task with +1
74and another task with +2, the CPU split between the two tasks would
75depend on the nice level of the parent shell - if it was at nice -10 the
76CPU split was different than if it was at +5 or +10.
77
78A third complaint against Linux's nice level support was that negative
79nice levels were not 'punchy enough', so lots of people had to resort to
80run audio (and other multimedia) apps under RT priorities such as
81SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
82proof, and a buggy SCHED_FIFO app can also lock up the system for good.
83
84The new scheduler in v2.6.23 addresses all three types of complaints:
85
86To address the first complaint (of nice levels being not "punchy"
87enough), the scheduler was decoupled from 'time slice' and HZ concepts
88(and granularity was made a separate concept from nice levels) and thus
89it was possible to implement better and more consistent nice +19
90support: with the new scheduler nice +19 tasks get a HZ-independent
911.5%, instead of the variable 3%-5%-9% range they got in the old
92scheduler.
93
94To address the second complaint (of nice levels not being consistent),
95the new scheduler makes nice(1) have the same CPU utilization effect on
96tasks, regardless of their absolute nice levels. So on the new
97scheduler, running a nice +10 and a nice 11 task has the same CPU
98utilization "split" between them as running a nice -5 and a nice -4
99task. (one will get 55% of the CPU, the other 45%.) That is why nice
100levels were changed to be "multiplicative" (or exponential) - that way
101it does not matter which nice level you start out from, the 'relative
102result' will always be the same.
103
104The third complaint (of negative nice levels not being "punchy" enough
105and forcing audio apps to run under the more dangerous SCHED_FIFO
106scheduling policy) is addressed by the new scheduler almost
107automatically: stronger negative nice levels are an automatic
108side-effect of the recalibrated dynamic range of nice levels.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 17249fae5014..682ef87da6eb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -139,7 +139,7 @@ struct cfs_rq;
139extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m); 139extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m);
140extern void proc_sched_set_task(struct task_struct *p); 140extern void proc_sched_set_task(struct task_struct *p);
141extern void 141extern void
142print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq, u64 now); 142print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
143#else 143#else
144static inline void 144static inline void
145proc_sched_show_task(struct task_struct *p, struct seq_file *m) 145proc_sched_show_task(struct task_struct *p, struct seq_file *m)
@@ -149,7 +149,7 @@ static inline void proc_sched_set_task(struct task_struct *p)
149{ 149{
150} 150}
151static inline void 151static inline void
152print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq, u64 now) 152print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
153{ 153{
154} 154}
155#endif 155#endif
@@ -855,26 +855,24 @@ struct sched_domain;
855struct sched_class { 855struct sched_class {
856 struct sched_class *next; 856 struct sched_class *next;
857 857
858 void (*enqueue_task) (struct rq *rq, struct task_struct *p, 858 void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
859 int wakeup, u64 now); 859 void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
860 void (*dequeue_task) (struct rq *rq, struct task_struct *p,
861 int sleep, u64 now);
862 void (*yield_task) (struct rq *rq, struct task_struct *p); 860 void (*yield_task) (struct rq *rq, struct task_struct *p);
863 861
864 void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); 862 void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
865 863
866 struct task_struct * (*pick_next_task) (struct rq *rq, u64 now); 864 struct task_struct * (*pick_next_task) (struct rq *rq);
867 void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now); 865 void (*put_prev_task) (struct rq *rq, struct task_struct *p);
868 866
869 int (*load_balance) (struct rq *this_rq, int this_cpu, 867 unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
870 struct rq *busiest, 868 struct rq *busiest,
871 unsigned long max_nr_move, unsigned long max_load_move, 869 unsigned long max_nr_move, unsigned long max_load_move,
872 struct sched_domain *sd, enum cpu_idle_type idle, 870 struct sched_domain *sd, enum cpu_idle_type idle,
873 int *all_pinned, unsigned long *total_load_moved); 871 int *all_pinned, int *this_best_prio);
874 872
875 void (*set_curr_task) (struct rq *rq); 873 void (*set_curr_task) (struct rq *rq);
876 void (*task_tick) (struct rq *rq, struct task_struct *p); 874 void (*task_tick) (struct rq *rq, struct task_struct *p);
877 void (*task_new) (struct rq *rq, struct task_struct *p, u64 now); 875 void (*task_new) (struct rq *rq, struct task_struct *p);
878}; 876};
879 877
880struct load_weight { 878struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index 72bb9483d949..b0afd8db1396 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -318,15 +318,19 @@ static inline int cpu_of(struct rq *rq)
318} 318}
319 319
320/* 320/*
321 * Per-runqueue clock, as finegrained as the platform can give us: 321 * Update the per-runqueue clock, as finegrained as the platform can give
322 * us, but without assuming monotonicity, etc.:
322 */ 323 */
323static unsigned long long __rq_clock(struct rq *rq) 324static void __update_rq_clock(struct rq *rq)
324{ 325{
325 u64 prev_raw = rq->prev_clock_raw; 326 u64 prev_raw = rq->prev_clock_raw;
326 u64 now = sched_clock(); 327 u64 now = sched_clock();
327 s64 delta = now - prev_raw; 328 s64 delta = now - prev_raw;
328 u64 clock = rq->clock; 329 u64 clock = rq->clock;
329 330
331#ifdef CONFIG_SCHED_DEBUG
332 WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
333#endif
330 /* 334 /*
331 * Protect against sched_clock() occasionally going backwards: 335 * Protect against sched_clock() occasionally going backwards:
332 */ 336 */
@@ -349,18 +353,12 @@ static unsigned long long __rq_clock(struct rq *rq)
349 353
350 rq->prev_clock_raw = now; 354 rq->prev_clock_raw = now;
351 rq->clock = clock; 355 rq->clock = clock;
352
353 return clock;
354} 356}
355 357
356static inline unsigned long long rq_clock(struct rq *rq) 358static void update_rq_clock(struct rq *rq)
357{ 359{
358 int this_cpu = smp_processor_id(); 360 if (likely(smp_processor_id() == cpu_of(rq)))
359 361 __update_rq_clock(rq);
360 if (this_cpu == cpu_of(rq))
361 return __rq_clock(rq);
362
363 return rq->clock;
364} 362}
365 363
366/* 364/*
@@ -386,9 +384,12 @@ unsigned long long cpu_clock(int cpu)
386{ 384{
387 unsigned long long now; 385 unsigned long long now;
388 unsigned long flags; 386 unsigned long flags;
387 struct rq *rq;
389 388
390 local_irq_save(flags); 389 local_irq_save(flags);
391 now = rq_clock(cpu_rq(cpu)); 390 rq = cpu_rq(cpu);
391 update_rq_clock(rq);
392 now = rq->clock;
392 local_irq_restore(flags); 393 local_irq_restore(flags);
393 394
394 return now; 395 return now;
@@ -637,6 +638,11 @@ static u64 div64_likely32(u64 divident, unsigned long divisor)
637 638
638#define WMULT_SHIFT 32 639#define WMULT_SHIFT 32
639 640
641/*
642 * Shift right and round:
643 */
644#define RSR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
645
640static unsigned long 646static unsigned long
641calc_delta_mine(unsigned long delta_exec, unsigned long weight, 647calc_delta_mine(unsigned long delta_exec, unsigned long weight,
642 struct load_weight *lw) 648 struct load_weight *lw)
@@ -644,18 +650,17 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
644 u64 tmp; 650 u64 tmp;
645 651
646 if (unlikely(!lw->inv_weight)) 652 if (unlikely(!lw->inv_weight))
647 lw->inv_weight = WMULT_CONST / lw->weight; 653 lw->inv_weight = (WMULT_CONST - lw->weight/2) / lw->weight + 1;
648 654
649 tmp = (u64)delta_exec * weight; 655 tmp = (u64)delta_exec * weight;
650 /* 656 /*
651 * Check whether we'd overflow the 64-bit multiplication: 657 * Check whether we'd overflow the 64-bit multiplication:
652 */ 658 */
653 if (unlikely(tmp > WMULT_CONST)) { 659 if (unlikely(tmp > WMULT_CONST))
654 tmp = ((tmp >> WMULT_SHIFT/2) * lw->inv_weight) 660 tmp = RSR(RSR(tmp, WMULT_SHIFT/2) * lw->inv_weight,
655 >> (WMULT_SHIFT/2); 661 WMULT_SHIFT/2);
656 } else { 662 else
657 tmp = (tmp * lw->inv_weight) >> WMULT_SHIFT; 663 tmp = RSR(tmp * lw->inv_weight, WMULT_SHIFT);
658 }
659 664
660 return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX); 665 return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
661} 666}
@@ -703,11 +708,14 @@ static void update_load_sub(struct load_weight *lw, unsigned long dec)
703 * the relative distance between them is ~25%.) 708 * the relative distance between them is ~25%.)
704 */ 709 */
705static const int prio_to_weight[40] = { 710static const int prio_to_weight[40] = {
706/* -20 */ 88818, 71054, 56843, 45475, 36380, 29104, 23283, 18626, 14901, 11921, 711 /* -20 */ 88761, 71755, 56483, 46273, 36291,
707/* -10 */ 9537, 7629, 6103, 4883, 3906, 3125, 2500, 2000, 1600, 1280, 712 /* -15 */ 29154, 23254, 18705, 14949, 11916,
708/* 0 */ NICE_0_LOAD /* 1024 */, 713 /* -10 */ 9548, 7620, 6100, 4904, 3906,
709/* 1 */ 819, 655, 524, 419, 336, 268, 215, 172, 137, 714 /* -5 */ 3121, 2501, 1991, 1586, 1277,
710/* 10 */ 110, 87, 70, 56, 45, 36, 29, 23, 18, 15, 715 /* 0 */ 1024, 820, 655, 526, 423,
716 /* 5 */ 335, 272, 215, 172, 137,
717 /* 10 */ 110, 87, 70, 56, 45,
718 /* 15 */ 36, 29, 23, 18, 15,
711}; 719};
712 720
713/* 721/*
@@ -718,14 +726,14 @@ static const int prio_to_weight[40] = {
718 * into multiplications: 726 * into multiplications:
719 */ 727 */
720static const u32 prio_to_wmult[40] = { 728static const u32 prio_to_wmult[40] = {
721/* -20 */ 48356, 60446, 75558, 94446, 118058, 729 /* -20 */ 48388, 59856, 76040, 92818, 118348,
722/* -15 */ 147573, 184467, 230589, 288233, 360285, 730 /* -15 */ 147320, 184698, 229616, 287308, 360437,
723/* -10 */ 450347, 562979, 703746, 879575, 1099582, 731 /* -10 */ 449829, 563644, 704093, 875809, 1099582,
724/* -5 */ 1374389, 1717986, 2147483, 2684354, 3355443, 732 /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
725/* 0 */ 4194304, 5244160, 6557201, 8196502, 10250518, 733 /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
726/* 5 */ 12782640, 16025997, 19976592, 24970740, 31350126, 734 /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
727/* 10 */ 39045157, 49367440, 61356675, 76695844, 95443717, 735 /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
728/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, 736 /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
729}; 737};
730 738
731static void activate_task(struct rq *rq, struct task_struct *p, int wakeup); 739static void activate_task(struct rq *rq, struct task_struct *p, int wakeup);
@@ -745,8 +753,7 @@ static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
745 unsigned long max_nr_move, unsigned long max_load_move, 753 unsigned long max_nr_move, unsigned long max_load_move,
746 struct sched_domain *sd, enum cpu_idle_type idle, 754 struct sched_domain *sd, enum cpu_idle_type idle,
747 int *all_pinned, unsigned long *load_moved, 755 int *all_pinned, unsigned long *load_moved,
748 int this_best_prio, int best_prio, int best_prio_seen, 756 int *this_best_prio, struct rq_iterator *iterator);
749 struct rq_iterator *iterator);
750 757
751#include "sched_stats.h" 758#include "sched_stats.h"
752#include "sched_rt.c" 759#include "sched_rt.c"
@@ -782,14 +789,14 @@ static void __update_curr_load(struct rq *rq, struct load_stat *ls)
782 * This function is called /before/ updating rq->ls.load 789 * This function is called /before/ updating rq->ls.load
783 * and when switching tasks. 790 * and when switching tasks.
784 */ 791 */
785static void update_curr_load(struct rq *rq, u64 now) 792static void update_curr_load(struct rq *rq)
786{ 793{
787 struct load_stat *ls = &rq->ls; 794 struct load_stat *ls = &rq->ls;
788 u64 start; 795 u64 start;
789 796
790 start = ls->load_update_start; 797 start = ls->load_update_start;
791 ls->load_update_start = now; 798 ls->load_update_start = rq->clock;
792 ls->delta_stat += now - start; 799 ls->delta_stat += rq->clock - start;
793 /* 800 /*
794 * Stagger updates to ls->delta_fair. Very frequent updates 801 * Stagger updates to ls->delta_fair. Very frequent updates
795 * can be expensive. 802 * can be expensive.
@@ -798,30 +805,28 @@ static void update_curr_load(struct rq *rq, u64 now)
798 __update_curr_load(rq, ls); 805 __update_curr_load(rq, ls);
799} 806}
800 807
801static inline void 808static inline void inc_load(struct rq *rq, const struct task_struct *p)
802inc_load(struct rq *rq, const struct task_struct *p, u64 now)
803{ 809{
804 update_curr_load(rq, now); 810 update_curr_load(rq);
805 update_load_add(&rq->ls.load, p->se.load.weight); 811 update_load_add(&rq->ls.load, p->se.load.weight);
806} 812}
807 813
808static inline void 814static inline void dec_load(struct rq *rq, const struct task_struct *p)
809dec_load(struct rq *rq, const struct task_struct *p, u64 now)
810{ 815{
811 update_curr_load(rq, now); 816 update_curr_load(rq);
812 update_load_sub(&rq->ls.load, p->se.load.weight); 817 update_load_sub(&rq->ls.load, p->se.load.weight);
813} 818}
814 819
815static void inc_nr_running(struct task_struct *p, struct rq *rq, u64 now) 820static void inc_nr_running(struct task_struct *p, struct rq *rq)
816{ 821{
817 rq->nr_running++; 822 rq->nr_running++;
818 inc_load(rq, p, now); 823 inc_load(rq, p);
819} 824}
820 825
821static void dec_nr_running(struct task_struct *p, struct rq *rq, u64 now) 826static void dec_nr_running(struct task_struct *p, struct rq *rq)
822{ 827{
823 rq->nr_running--; 828 rq->nr_running--;
824 dec_load(rq, p, now); 829 dec_load(rq, p);
825} 830}
826 831
827static void set_load_weight(struct task_struct *p) 832static void set_load_weight(struct task_struct *p)
@@ -848,18 +853,16 @@ static void set_load_weight(struct task_struct *p)
848 p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO]; 853 p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
849} 854}
850 855
851static void 856static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
852enqueue_task(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
853{ 857{
854 sched_info_queued(p); 858 sched_info_queued(p);
855 p->sched_class->enqueue_task(rq, p, wakeup, now); 859 p->sched_class->enqueue_task(rq, p, wakeup);
856 p->se.on_rq = 1; 860 p->se.on_rq = 1;
857} 861}
858 862
859static void 863static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
860dequeue_task(struct rq *rq, struct task_struct *p, int sleep, u64 now)
861{ 864{
862 p->sched_class->dequeue_task(rq, p, sleep, now); 865 p->sched_class->dequeue_task(rq, p, sleep);
863 p->se.on_rq = 0; 866 p->se.on_rq = 0;
864} 867}
865 868
@@ -914,13 +917,11 @@ static int effective_prio(struct task_struct *p)
914 */ 917 */
915static void activate_task(struct rq *rq, struct task_struct *p, int wakeup) 918static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
916{ 919{
917 u64 now = rq_clock(rq);
918
919 if (p->state == TASK_UNINTERRUPTIBLE) 920 if (p->state == TASK_UNINTERRUPTIBLE)
920 rq->nr_uninterruptible--; 921 rq->nr_uninterruptible--;
921 922
922 enqueue_task(rq, p, wakeup, now); 923 enqueue_task(rq, p, wakeup);
923 inc_nr_running(p, rq, now); 924 inc_nr_running(p, rq);
924} 925}
925 926
926/* 927/*
@@ -928,13 +929,13 @@ static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
928 */ 929 */
929static inline void activate_idle_task(struct task_struct *p, struct rq *rq) 930static inline void activate_idle_task(struct task_struct *p, struct rq *rq)
930{ 931{
931 u64 now = rq_clock(rq); 932 update_rq_clock(rq);
932 933
933 if (p->state == TASK_UNINTERRUPTIBLE) 934 if (p->state == TASK_UNINTERRUPTIBLE)
934 rq->nr_uninterruptible--; 935 rq->nr_uninterruptible--;
935 936
936 enqueue_task(rq, p, 0, now); 937 enqueue_task(rq, p, 0);
937 inc_nr_running(p, rq, now); 938 inc_nr_running(p, rq);
938} 939}
939 940
940/* 941/*
@@ -942,13 +943,11 @@ static inline void activate_idle_task(struct task_struct *p, struct rq *rq)
942 */ 943 */
943static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep) 944static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)
944{ 945{
945 u64 now = rq_clock(rq);
946
947 if (p->state == TASK_UNINTERRUPTIBLE) 946 if (p->state == TASK_UNINTERRUPTIBLE)
948 rq->nr_uninterruptible++; 947 rq->nr_uninterruptible++;
949 948
950 dequeue_task(rq, p, sleep, now); 949 dequeue_task(rq, p, sleep);
951 dec_nr_running(p, rq, now); 950 dec_nr_running(p, rq);
952} 951}
953 952
954/** 953/**
@@ -1516,6 +1515,7 @@ out_set_cpu:
1516 1515
1517out_activate: 1516out_activate:
1518#endif /* CONFIG_SMP */ 1517#endif /* CONFIG_SMP */
1518 update_rq_clock(rq);
1519 activate_task(rq, p, 1); 1519 activate_task(rq, p, 1);
1520 /* 1520 /*
1521 * Sync wakeups (i.e. those types of wakeups where the waker 1521 * Sync wakeups (i.e. those types of wakeups where the waker
@@ -1647,12 +1647,11 @@ void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
1647 unsigned long flags; 1647 unsigned long flags;
1648 struct rq *rq; 1648 struct rq *rq;
1649 int this_cpu; 1649 int this_cpu;
1650 u64 now;
1651 1650
1652 rq = task_rq_lock(p, &flags); 1651 rq = task_rq_lock(p, &flags);
1653 BUG_ON(p->state != TASK_RUNNING); 1652 BUG_ON(p->state != TASK_RUNNING);
1654 this_cpu = smp_processor_id(); /* parent's CPU */ 1653 this_cpu = smp_processor_id(); /* parent's CPU */
1655 now = rq_clock(rq); 1654 update_rq_clock(rq);
1656 1655
1657 p->prio = effective_prio(p); 1656 p->prio = effective_prio(p);
1658 1657
@@ -1666,8 +1665,8 @@ void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
1666 * Let the scheduling class do new task startup 1665 * Let the scheduling class do new task startup
1667 * management (if any): 1666 * management (if any):
1668 */ 1667 */
1669 p->sched_class->task_new(rq, p, now); 1668 p->sched_class->task_new(rq, p);
1670 inc_nr_running(p, rq, now); 1669 inc_nr_running(p, rq);
1671 } 1670 }
1672 check_preempt_curr(rq, p); 1671 check_preempt_curr(rq, p);
1673 task_rq_unlock(rq, &flags); 1672 task_rq_unlock(rq, &flags);
@@ -1954,7 +1953,6 @@ static void update_cpu_load(struct rq *this_rq)
1954 unsigned long total_load = this_rq->ls.load.weight; 1953 unsigned long total_load = this_rq->ls.load.weight;
1955 unsigned long this_load = total_load; 1954 unsigned long this_load = total_load;
1956 struct load_stat *ls = &this_rq->ls; 1955 struct load_stat *ls = &this_rq->ls;
1957 u64 now = __rq_clock(this_rq);
1958 int i, scale; 1956 int i, scale;
1959 1957
1960 this_rq->nr_load_updates++; 1958 this_rq->nr_load_updates++;
@@ -1962,7 +1960,7 @@ static void update_cpu_load(struct rq *this_rq)
1962 goto do_avg; 1960 goto do_avg;
1963 1961
1964 /* Update delta_fair/delta_exec fields first */ 1962 /* Update delta_fair/delta_exec fields first */
1965 update_curr_load(this_rq, now); 1963 update_curr_load(this_rq);
1966 1964
1967 fair_delta64 = ls->delta_fair + 1; 1965 fair_delta64 = ls->delta_fair + 1;
1968 ls->delta_fair = 0; 1966 ls->delta_fair = 0;
@@ -1970,8 +1968,8 @@ static void update_cpu_load(struct rq *this_rq)
1970 exec_delta64 = ls->delta_exec + 1; 1968 exec_delta64 = ls->delta_exec + 1;
1971 ls->delta_exec = 0; 1969 ls->delta_exec = 0;
1972 1970
1973 sample_interval64 = now - ls->load_update_last; 1971 sample_interval64 = this_rq->clock - ls->load_update_last;
1974 ls->load_update_last = now; 1972 ls->load_update_last = this_rq->clock;
1975 1973
1976 if ((s64)sample_interval64 < (s64)TICK_NSEC) 1974 if ((s64)sample_interval64 < (s64)TICK_NSEC)
1977 sample_interval64 = TICK_NSEC; 1975 sample_interval64 = TICK_NSEC;
@@ -2026,6 +2024,8 @@ static void double_rq_lock(struct rq *rq1, struct rq *rq2)
2026 spin_lock(&rq1->lock); 2024 spin_lock(&rq1->lock);
2027 } 2025 }
2028 } 2026 }
2027 update_rq_clock(rq1);
2028 update_rq_clock(rq2);
2029} 2029}
2030 2030
2031/* 2031/*
@@ -2166,8 +2166,7 @@ static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
2166 unsigned long max_nr_move, unsigned long max_load_move, 2166 unsigned long max_nr_move, unsigned long max_load_move,
2167 struct sched_domain *sd, enum cpu_idle_type idle, 2167 struct sched_domain *sd, enum cpu_idle_type idle,
2168 int *all_pinned, unsigned long *load_moved, 2168 int *all_pinned, unsigned long *load_moved,
2169 int this_best_prio, int best_prio, int best_prio_seen, 2169 int *this_best_prio, struct rq_iterator *iterator)
2170 struct rq_iterator *iterator)
2171{ 2170{
2172 int pulled = 0, pinned = 0, skip_for_load; 2171 int pulled = 0, pinned = 0, skip_for_load;
2173 struct task_struct *p; 2172 struct task_struct *p;
@@ -2192,12 +2191,8 @@ next:
2192 */ 2191 */
2193 skip_for_load = (p->se.load.weight >> 1) > rem_load_move + 2192 skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
2194 SCHED_LOAD_SCALE_FUZZ; 2193 SCHED_LOAD_SCALE_FUZZ;
2195 if (skip_for_load && p->prio < this_best_prio) 2194 if ((skip_for_load && p->prio >= *this_best_prio) ||
2196 skip_for_load = !best_prio_seen && p->prio == best_prio;
2197 if (skip_for_load ||
2198 !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) { 2195 !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
2199
2200 best_prio_seen |= p->prio == best_prio;
2201 p = iterator->next(iterator->arg); 2196 p = iterator->next(iterator->arg);
2202 goto next; 2197 goto next;
2203 } 2198 }
@@ -2211,8 +2206,8 @@ next:
2211 * and the prescribed amount of weighted load. 2206 * and the prescribed amount of weighted load.
2212 */ 2207 */
2213 if (pulled < max_nr_move && rem_load_move > 0) { 2208 if (pulled < max_nr_move && rem_load_move > 0) {
2214 if (p->prio < this_best_prio) 2209 if (p->prio < *this_best_prio)
2215 this_best_prio = p->prio; 2210 *this_best_prio = p->prio;
2216 p = iterator->next(iterator->arg); 2211 p = iterator->next(iterator->arg);
2217 goto next; 2212 goto next;
2218 } 2213 }
@@ -2231,32 +2226,52 @@ out:
2231} 2226}
2232 2227
2233/* 2228/*
2234 * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted 2229 * move_tasks tries to move up to max_load_move weighted load from busiest to
2235 * load from busiest to this_rq, as part of a balancing operation within 2230 * this_rq, as part of a balancing operation within domain "sd".
2236 * "domain". Returns the number of tasks moved. 2231 * Returns 1 if successful and 0 otherwise.
2237 * 2232 *
2238 * Called with both runqueues locked. 2233 * Called with both runqueues locked.
2239 */ 2234 */
2240static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, 2235static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
2241 unsigned long max_nr_move, unsigned long max_load_move, 2236 unsigned long max_load_move,
2242 struct sched_domain *sd, enum cpu_idle_type idle, 2237 struct sched_domain *sd, enum cpu_idle_type idle,
2243 int *all_pinned) 2238 int *all_pinned)
2244{ 2239{
2245 struct sched_class *class = sched_class_highest; 2240 struct sched_class *class = sched_class_highest;
2246 unsigned long load_moved, total_nr_moved = 0, nr_moved; 2241 unsigned long total_load_moved = 0;
2247 long rem_load_move = max_load_move; 2242 int this_best_prio = this_rq->curr->prio;
2248 2243
2249 do { 2244 do {
2250 nr_moved = class->load_balance(this_rq, this_cpu, busiest, 2245 total_load_moved +=
2251 max_nr_move, (unsigned long)rem_load_move, 2246 class->load_balance(this_rq, this_cpu, busiest,
2252 sd, idle, all_pinned, &load_moved); 2247 ULONG_MAX, max_load_move - total_load_moved,
2253 total_nr_moved += nr_moved; 2248 sd, idle, all_pinned, &this_best_prio);
2254 max_nr_move -= nr_moved;
2255 rem_load_move -= load_moved;
2256 class = class->next; 2249 class = class->next;
2257 } while (class && max_nr_move && rem_load_move > 0); 2250 } while (class && max_load_move > total_load_moved);
2258 2251
2259 return total_nr_moved; 2252 return total_load_moved > 0;
2253}
2254
2255/*
2256 * move_one_task tries to move exactly one task from busiest to this_rq, as
2257 * part of active balancing operations within "domain".
2258 * Returns 1 if successful and 0 otherwise.
2259 *
2260 * Called with both runqueues locked.
2261 */
2262static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
2263 struct sched_domain *sd, enum cpu_idle_type idle)
2264{
2265 struct sched_class *class;
2266 int this_best_prio = MAX_PRIO;
2267
2268 for (class = sched_class_highest; class; class = class->next)
2269 if (class->load_balance(this_rq, this_cpu, busiest,
2270 1, ULONG_MAX, sd, idle, NULL,
2271 &this_best_prio))
2272 return 1;
2273
2274 return 0;
2260} 2275}
2261 2276
2262/* 2277/*
@@ -2588,11 +2603,6 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
2588 */ 2603 */
2589#define MAX_PINNED_INTERVAL 512 2604#define MAX_PINNED_INTERVAL 512
2590 2605
2591static inline unsigned long minus_1_or_zero(unsigned long n)
2592{
2593 return n > 0 ? n - 1 : 0;
2594}
2595
2596/* 2606/*
2597 * Check this_cpu to ensure it is balanced within domain. Attempt to move 2607 * Check this_cpu to ensure it is balanced within domain. Attempt to move
2598 * tasks if there is an imbalance. 2608 * tasks if there is an imbalance.
@@ -2601,7 +2611,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
2601 struct sched_domain *sd, enum cpu_idle_type idle, 2611 struct sched_domain *sd, enum cpu_idle_type idle,
2602 int *balance) 2612 int *balance)
2603{ 2613{
2604 int nr_moved, all_pinned = 0, active_balance = 0, sd_idle = 0; 2614 int ld_moved, all_pinned = 0, active_balance = 0, sd_idle = 0;
2605 struct sched_group *group; 2615 struct sched_group *group;
2606 unsigned long imbalance; 2616 unsigned long imbalance;
2607 struct rq *busiest; 2617 struct rq *busiest;
@@ -2642,18 +2652,17 @@ redo:
2642 2652
2643 schedstat_add(sd, lb_imbalance[idle], imbalance); 2653 schedstat_add(sd, lb_imbalance[idle], imbalance);
2644 2654
2645 nr_moved = 0; 2655 ld_moved = 0;
2646 if (busiest->nr_running > 1) { 2656 if (busiest->nr_running > 1) {
2647 /* 2657 /*
2648 * Attempt to move tasks. If find_busiest_group has found 2658 * Attempt to move tasks. If find_busiest_group has found
2649 * an imbalance but busiest->nr_running <= 1, the group is 2659 * an imbalance but busiest->nr_running <= 1, the group is
2650 * still unbalanced. nr_moved simply stays zero, so it is 2660 * still unbalanced. ld_moved simply stays zero, so it is
2651 * correctly treated as an imbalance. 2661 * correctly treated as an imbalance.
2652 */ 2662 */
2653 local_irq_save(flags); 2663 local_irq_save(flags);
2654 double_rq_lock(this_rq, busiest); 2664 double_rq_lock(this_rq, busiest);
2655 nr_moved = move_tasks(this_rq, this_cpu, busiest, 2665 ld_moved = move_tasks(this_rq, this_cpu, busiest,
2656 minus_1_or_zero(busiest->nr_running),
2657 imbalance, sd, idle, &all_pinned); 2666 imbalance, sd, idle, &all_pinned);
2658 double_rq_unlock(this_rq, busiest); 2667 double_rq_unlock(this_rq, busiest);
2659 local_irq_restore(flags); 2668 local_irq_restore(flags);
@@ -2661,7 +2670,7 @@ redo:
2661 /* 2670 /*
2662 * some other cpu did the load balance for us. 2671 * some other cpu did the load balance for us.
2663 */ 2672 */
2664 if (nr_moved && this_cpu != smp_processor_id()) 2673 if (ld_moved && this_cpu != smp_processor_id())
2665 resched_cpu(this_cpu); 2674 resched_cpu(this_cpu);
2666 2675
2667 /* All tasks on this runqueue were pinned by CPU affinity */ 2676 /* All tasks on this runqueue were pinned by CPU affinity */
@@ -2673,7 +2682,7 @@ redo:
2673 } 2682 }
2674 } 2683 }
2675 2684
2676 if (!nr_moved) { 2685 if (!ld_moved) {
2677 schedstat_inc(sd, lb_failed[idle]); 2686 schedstat_inc(sd, lb_failed[idle]);
2678 sd->nr_balance_failed++; 2687 sd->nr_balance_failed++;
2679 2688
@@ -2722,10 +2731,10 @@ redo:
2722 sd->balance_interval *= 2; 2731 sd->balance_interval *= 2;
2723 } 2732 }
2724 2733
2725 if (!nr_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER && 2734 if (!ld_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
2726 !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE)) 2735 !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
2727 return -1; 2736 return -1;
2728 return nr_moved; 2737 return ld_moved;
2729 2738
2730out_balanced: 2739out_balanced:
2731 schedstat_inc(sd, lb_balanced[idle]); 2740 schedstat_inc(sd, lb_balanced[idle]);
@@ -2757,7 +2766,7 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd)
2757 struct sched_group *group; 2766 struct sched_group *group;
2758 struct rq *busiest = NULL; 2767 struct rq *busiest = NULL;
2759 unsigned long imbalance; 2768 unsigned long imbalance;
2760 int nr_moved = 0; 2769 int ld_moved = 0;
2761 int sd_idle = 0; 2770 int sd_idle = 0;
2762 int all_pinned = 0; 2771 int all_pinned = 0;
2763 cpumask_t cpus = CPU_MASK_ALL; 2772 cpumask_t cpus = CPU_MASK_ALL;
@@ -2792,12 +2801,13 @@ redo:
2792 2801
2793 schedstat_add(sd, lb_imbalance[CPU_NEWLY_IDLE], imbalance); 2802 schedstat_add(sd, lb_imbalance[CPU_NEWLY_IDLE], imbalance);
2794 2803
2795 nr_moved = 0; 2804 ld_moved = 0;
2796 if (busiest->nr_running > 1) { 2805 if (busiest->nr_running > 1) {
2797 /* Attempt to move tasks */ 2806 /* Attempt to move tasks */
2798 double_lock_balance(this_rq, busiest); 2807 double_lock_balance(this_rq, busiest);
2799 nr_moved = move_tasks(this_rq, this_cpu, busiest, 2808 /* this_rq->clock is already updated */
2800 minus_1_or_zero(busiest->nr_running), 2809 update_rq_clock(busiest);
2810 ld_moved = move_tasks(this_rq, this_cpu, busiest,
2801 imbalance, sd, CPU_NEWLY_IDLE, 2811 imbalance, sd, CPU_NEWLY_IDLE,
2802 &all_pinned); 2812 &all_pinned);
2803 spin_unlock(&busiest->lock); 2813 spin_unlock(&busiest->lock);
@@ -2809,7 +2819,7 @@ redo:
2809 } 2819 }
2810 } 2820 }
2811 2821
2812 if (!nr_moved) { 2822 if (!ld_moved) {
2813 schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]); 2823 schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
2814 if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER && 2824 if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
2815 !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE)) 2825 !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
@@ -2817,7 +2827,7 @@ redo:
2817 } else 2827 } else
2818 sd->nr_balance_failed = 0; 2828 sd->nr_balance_failed = 0;
2819 2829
2820 return nr_moved; 2830 return ld_moved;
2821 2831
2822out_balanced: 2832out_balanced:
2823 schedstat_inc(sd, lb_balanced[CPU_NEWLY_IDLE]); 2833 schedstat_inc(sd, lb_balanced[CPU_NEWLY_IDLE]);
@@ -2894,6 +2904,8 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
2894 2904
2895 /* move a task from busiest_rq to target_rq */ 2905 /* move a task from busiest_rq to target_rq */
2896 double_lock_balance(busiest_rq, target_rq); 2906 double_lock_balance(busiest_rq, target_rq);
2907 update_rq_clock(busiest_rq);
2908 update_rq_clock(target_rq);
2897 2909
2898 /* Search for an sd spanning us and the target CPU. */ 2910 /* Search for an sd spanning us and the target CPU. */
2899 for_each_domain(target_cpu, sd) { 2911 for_each_domain(target_cpu, sd) {
@@ -2905,8 +2917,8 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
2905 if (likely(sd)) { 2917 if (likely(sd)) {
2906 schedstat_inc(sd, alb_cnt); 2918 schedstat_inc(sd, alb_cnt);
2907 2919
2908 if (move_tasks(target_rq, target_cpu, busiest_rq, 1, 2920 if (move_one_task(target_rq, target_cpu, busiest_rq,
2909 ULONG_MAX, sd, CPU_IDLE, NULL)) 2921 sd, CPU_IDLE))
2910 schedstat_inc(sd, alb_pushed); 2922 schedstat_inc(sd, alb_pushed);
2911 else 2923 else
2912 schedstat_inc(sd, alb_failed); 2924 schedstat_inc(sd, alb_failed);
@@ -3175,8 +3187,7 @@ static int balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
3175 unsigned long max_nr_move, unsigned long max_load_move, 3187 unsigned long max_nr_move, unsigned long max_load_move,
3176 struct sched_domain *sd, enum cpu_idle_type idle, 3188 struct sched_domain *sd, enum cpu_idle_type idle,
3177 int *all_pinned, unsigned long *load_moved, 3189 int *all_pinned, unsigned long *load_moved,
3178 int this_best_prio, int best_prio, int best_prio_seen, 3190 int *this_best_prio, struct rq_iterator *iterator)
3179 struct rq_iterator *iterator)
3180{ 3191{
3181 *load_moved = 0; 3192 *load_moved = 0;
3182 3193
@@ -3202,7 +3213,8 @@ unsigned long long task_sched_runtime(struct task_struct *p)
3202 rq = task_rq_lock(p, &flags); 3213 rq = task_rq_lock(p, &flags);
3203 ns = p->se.sum_exec_runtime; 3214 ns = p->se.sum_exec_runtime;
3204 if (rq->curr == p) { 3215 if (rq->curr == p) {
3205 delta_exec = rq_clock(rq) - p->se.exec_start; 3216 update_rq_clock(rq);
3217 delta_exec = rq->clock - p->se.exec_start;
3206 if ((s64)delta_exec > 0) 3218 if ((s64)delta_exec > 0)
3207 ns += delta_exec; 3219 ns += delta_exec;
3208 } 3220 }
@@ -3298,9 +3310,10 @@ void scheduler_tick(void)
3298 struct task_struct *curr = rq->curr; 3310 struct task_struct *curr = rq->curr;
3299 3311
3300 spin_lock(&rq->lock); 3312 spin_lock(&rq->lock);
3313 __update_rq_clock(rq);
3314 update_cpu_load(rq);
3301 if (curr != rq->idle) /* FIXME: needed? */ 3315 if (curr != rq->idle) /* FIXME: needed? */
3302 curr->sched_class->task_tick(rq, curr); 3316 curr->sched_class->task_tick(rq, curr);
3303 update_cpu_load(rq);
3304 spin_unlock(&rq->lock); 3317 spin_unlock(&rq->lock);
3305 3318
3306#ifdef CONFIG_SMP 3319#ifdef CONFIG_SMP
@@ -3382,7 +3395,7 @@ static inline void schedule_debug(struct task_struct *prev)
3382 * Pick up the highest-prio task: 3395 * Pick up the highest-prio task:
3383 */ 3396 */
3384static inline struct task_struct * 3397static inline struct task_struct *
3385pick_next_task(struct rq *rq, struct task_struct *prev, u64 now) 3398pick_next_task(struct rq *rq, struct task_struct *prev)
3386{ 3399{
3387 struct sched_class *class; 3400 struct sched_class *class;
3388 struct task_struct *p; 3401 struct task_struct *p;
@@ -3392,14 +3405,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, u64 now)
3392 * the fair class we can call that function directly: 3405 * the fair class we can call that function directly:
3393 */ 3406 */
3394 if (likely(rq->nr_running == rq->cfs.nr_running)) { 3407 if (likely(rq->nr_running == rq->cfs.nr_running)) {
3395 p = fair_sched_class.pick_next_task(rq, now); 3408 p = fair_sched_class.pick_next_task(rq);
3396 if (likely(p)) 3409 if (likely(p))
3397 return p; 3410 return p;
3398 } 3411 }
3399 3412
3400 class = sched_class_highest; 3413 class = sched_class_highest;
3401 for ( ; ; ) { 3414 for ( ; ; ) {
3402 p = class->pick_next_task(rq, now); 3415 p = class->pick_next_task(rq);
3403 if (p) 3416 if (p)
3404 return p; 3417 return p;
3405 /* 3418 /*
@@ -3418,7 +3431,6 @@ asmlinkage void __sched schedule(void)
3418 struct task_struct *prev, *next; 3431 struct task_struct *prev, *next;
3419 long *switch_count; 3432 long *switch_count;
3420 struct rq *rq; 3433 struct rq *rq;
3421 u64 now;
3422 int cpu; 3434 int cpu;
3423 3435
3424need_resched: 3436need_resched:
@@ -3436,6 +3448,7 @@ need_resched_nonpreemptible:
3436 3448
3437 spin_lock_irq(&rq->lock); 3449 spin_lock_irq(&rq->lock);
3438 clear_tsk_need_resched(prev); 3450 clear_tsk_need_resched(prev);
3451 __update_rq_clock(rq);
3439 3452
3440 if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { 3453 if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
3441 if (unlikely((prev->state & TASK_INTERRUPTIBLE) && 3454 if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
@@ -3450,9 +3463,8 @@ need_resched_nonpreemptible:
3450 if (unlikely(!rq->nr_running)) 3463 if (unlikely(!rq->nr_running))
3451 idle_balance(cpu, rq); 3464 idle_balance(cpu, rq);
3452 3465
3453 now = __rq_clock(rq); 3466 prev->sched_class->put_prev_task(rq, prev);
3454 prev->sched_class->put_prev_task(rq, prev, now); 3467 next = pick_next_task(rq, prev);
3455 next = pick_next_task(rq, prev, now);
3456 3468
3457 sched_info_switch(prev, next); 3469 sched_info_switch(prev, next);
3458 3470
@@ -3895,17 +3907,16 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
3895 unsigned long flags; 3907 unsigned long flags;
3896 int oldprio, on_rq; 3908 int oldprio, on_rq;
3897 struct rq *rq; 3909 struct rq *rq;
3898 u64 now;
3899 3910
3900 BUG_ON(prio < 0 || prio > MAX_PRIO); 3911 BUG_ON(prio < 0 || prio > MAX_PRIO);
3901 3912
3902 rq = task_rq_lock(p, &flags); 3913 rq = task_rq_lock(p, &flags);
3903 now = rq_clock(rq); 3914 update_rq_clock(rq);
3904 3915
3905 oldprio = p->prio; 3916 oldprio = p->prio;
3906 on_rq = p->se.on_rq; 3917 on_rq = p->se.on_rq;
3907 if (on_rq) 3918 if (on_rq)
3908 dequeue_task(rq, p, 0, now); 3919 dequeue_task(rq, p, 0);
3909 3920
3910 if (rt_prio(prio)) 3921 if (rt_prio(prio))
3911 p->sched_class = &rt_sched_class; 3922 p->sched_class = &rt_sched_class;
@@ -3915,7 +3926,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
3915 p->prio = prio; 3926 p->prio = prio;
3916 3927
3917 if (on_rq) { 3928 if (on_rq) {
3918 enqueue_task(rq, p, 0, now); 3929 enqueue_task(rq, p, 0);
3919 /* 3930 /*
3920 * Reschedule if we are currently running on this runqueue and 3931 * Reschedule if we are currently running on this runqueue and
3921 * our priority decreased, or if we are not currently running on 3932 * our priority decreased, or if we are not currently running on
@@ -3938,7 +3949,6 @@ void set_user_nice(struct task_struct *p, long nice)
3938 int old_prio, delta, on_rq; 3949 int old_prio, delta, on_rq;
3939 unsigned long flags; 3950 unsigned long flags;
3940 struct rq *rq; 3951 struct rq *rq;
3941 u64 now;
3942 3952
3943 if (TASK_NICE(p) == nice || nice < -20 || nice > 19) 3953 if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
3944 return; 3954 return;
@@ -3947,7 +3957,7 @@ void set_user_nice(struct task_struct *p, long nice)
3947 * the task might be in the middle of scheduling on another CPU. 3957 * the task might be in the middle of scheduling on another CPU.
3948 */ 3958 */
3949 rq = task_rq_lock(p, &flags); 3959 rq = task_rq_lock(p, &flags);
3950 now = rq_clock(rq); 3960 update_rq_clock(rq);
3951 /* 3961 /*
3952 * The RT priorities are set via sched_setscheduler(), but we still 3962 * The RT priorities are set via sched_setscheduler(), but we still
3953 * allow the 'normal' nice value to be set - but as expected 3963 * allow the 'normal' nice value to be set - but as expected
@@ -3960,8 +3970,8 @@ void set_user_nice(struct task_struct *p, long nice)
3960 } 3970 }
3961 on_rq = p->se.on_rq; 3971 on_rq = p->se.on_rq;
3962 if (on_rq) { 3972 if (on_rq) {
3963 dequeue_task(rq, p, 0, now); 3973 dequeue_task(rq, p, 0);
3964 dec_load(rq, p, now); 3974 dec_load(rq, p);
3965 } 3975 }
3966 3976
3967 p->static_prio = NICE_TO_PRIO(nice); 3977 p->static_prio = NICE_TO_PRIO(nice);
@@ -3971,8 +3981,8 @@ void set_user_nice(struct task_struct *p, long nice)
3971 delta = p->prio - old_prio; 3981 delta = p->prio - old_prio;
3972 3982
3973 if (on_rq) { 3983 if (on_rq) {
3974 enqueue_task(rq, p, 0, now); 3984 enqueue_task(rq, p, 0);
3975 inc_load(rq, p, now); 3985 inc_load(rq, p);
3976 /* 3986 /*
3977 * If the task increased its priority or is running and 3987 * If the task increased its priority or is running and
3978 * lowered its priority, then reschedule its CPU: 3988 * lowered its priority, then reschedule its CPU:
@@ -4208,6 +4218,7 @@ recheck:
4208 spin_unlock_irqrestore(&p->pi_lock, flags); 4218 spin_unlock_irqrestore(&p->pi_lock, flags);
4209 goto recheck; 4219 goto recheck;
4210 } 4220 }
4221 update_rq_clock(rq);
4211 on_rq = p->se.on_rq; 4222 on_rq = p->se.on_rq;
4212 if (on_rq) 4223 if (on_rq)
4213 deactivate_task(rq, p, 0); 4224 deactivate_task(rq, p, 0);
@@ -4463,10 +4474,8 @@ long sched_getaffinity(pid_t pid, cpumask_t *mask)
4463out_unlock: 4474out_unlock:
4464 read_unlock(&tasklist_lock); 4475 read_unlock(&tasklist_lock);
4465 mutex_unlock(&sched_hotcpu_mutex); 4476 mutex_unlock(&sched_hotcpu_mutex);
4466 if (retval)
4467 return retval;
4468 4477
4469 return 0; 4478 return retval;
4470} 4479}
4471 4480
4472/** 4481/**
@@ -4966,6 +4975,7 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
4966 on_rq = p->se.on_rq; 4975 on_rq = p->se.on_rq;
4967 if (on_rq) 4976 if (on_rq)
4968 deactivate_task(rq_src, p, 0); 4977 deactivate_task(rq_src, p, 0);
4978
4969 set_task_cpu(p, dest_cpu); 4979 set_task_cpu(p, dest_cpu);
4970 if (on_rq) { 4980 if (on_rq) {
4971 activate_task(rq_dest, p, 0); 4981 activate_task(rq_dest, p, 0);
@@ -5198,7 +5208,8 @@ static void migrate_dead_tasks(unsigned int dead_cpu)
5198 for ( ; ; ) { 5208 for ( ; ; ) {
5199 if (!rq->nr_running) 5209 if (!rq->nr_running)
5200 break; 5210 break;
5201 next = pick_next_task(rq, rq->curr, rq_clock(rq)); 5211 update_rq_clock(rq);
5212 next = pick_next_task(rq, rq->curr);
5202 if (!next) 5213 if (!next)
5203 break; 5214 break;
5204 migrate_dead(dead_cpu, next); 5215 migrate_dead(dead_cpu, next);
@@ -5210,12 +5221,19 @@ static void migrate_dead_tasks(unsigned int dead_cpu)
5210#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL) 5221#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
5211 5222
5212static struct ctl_table sd_ctl_dir[] = { 5223static struct ctl_table sd_ctl_dir[] = {
5213 {CTL_UNNUMBERED, "sched_domain", NULL, 0, 0755, NULL, }, 5224 {
5225 .procname = "sched_domain",
5226 .mode = 0755,
5227 },
5214 {0,}, 5228 {0,},
5215}; 5229};
5216 5230
5217static struct ctl_table sd_ctl_root[] = { 5231static struct ctl_table sd_ctl_root[] = {
5218 {CTL_UNNUMBERED, "kernel", NULL, 0, 0755, sd_ctl_dir, }, 5232 {
5233 .procname = "kernel",
5234 .mode = 0755,
5235 .child = sd_ctl_dir,
5236 },
5219 {0,}, 5237 {0,},
5220}; 5238};
5221 5239
@@ -5231,11 +5249,10 @@ static struct ctl_table *sd_alloc_ctl_entry(int n)
5231} 5249}
5232 5250
5233static void 5251static void
5234set_table_entry(struct ctl_table *entry, int ctl_name, 5252set_table_entry(struct ctl_table *entry,
5235 const char *procname, void *data, int maxlen, 5253 const char *procname, void *data, int maxlen,
5236 mode_t mode, proc_handler *proc_handler) 5254 mode_t mode, proc_handler *proc_handler)
5237{ 5255{
5238 entry->ctl_name = ctl_name;
5239 entry->procname = procname; 5256 entry->procname = procname;
5240 entry->data = data; 5257 entry->data = data;
5241 entry->maxlen = maxlen; 5258 entry->maxlen = maxlen;
@@ -5248,28 +5265,28 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
5248{ 5265{
5249 struct ctl_table *table = sd_alloc_ctl_entry(14); 5266 struct ctl_table *table = sd_alloc_ctl_entry(14);
5250 5267
5251 set_table_entry(&table[0], 1, "min_interval", &sd->min_interval, 5268 set_table_entry(&table[0], "min_interval", &sd->min_interval,
5252 sizeof(long), 0644, proc_doulongvec_minmax); 5269 sizeof(long), 0644, proc_doulongvec_minmax);
5253 set_table_entry(&table[1], 2, "max_interval", &sd->max_interval, 5270 set_table_entry(&table[1], "max_interval", &sd->max_interval,
5254 sizeof(long), 0644, proc_doulongvec_minmax); 5271 sizeof(long), 0644, proc_doulongvec_minmax);
5255 set_table_entry(&table[2], 3, "busy_idx", &sd->busy_idx, 5272 set_table_entry(&table[2], "busy_idx", &sd->busy_idx,
5256 sizeof(int), 0644, proc_dointvec_minmax); 5273 sizeof(int), 0644, proc_dointvec_minmax);
5257 set_table_entry(&table[3], 4, "idle_idx", &sd->idle_idx, 5274 set_table_entry(&table[3], "idle_idx", &sd->idle_idx,
5258 sizeof(int), 0644, proc_dointvec_minmax); 5275 sizeof(int), 0644, proc_dointvec_minmax);
5259 set_table_entry(&table[4], 5, "newidle_idx", &sd->newidle_idx, 5276 set_table_entry(&table[4], "newidle_idx", &sd->newidle_idx,
5260 sizeof(int), 0644, proc_dointvec_minmax); 5277 sizeof(int), 0644, proc_dointvec_minmax);
5261 set_table_entry(&table[5], 6, "wake_idx", &sd->wake_idx, 5278 set_table_entry(&table[5], "wake_idx", &sd->wake_idx,
5262 sizeof(int), 0644, proc_dointvec_minmax); 5279 sizeof(int), 0644, proc_dointvec_minmax);
5263 set_table_entry(&table[6], 7, "forkexec_idx", &sd->forkexec_idx, 5280 set_table_entry(&table[6], "forkexec_idx", &sd->forkexec_idx,
5264 sizeof(int), 0644, proc_dointvec_minmax); 5281 sizeof(int), 0644, proc_dointvec_minmax);
5265 set_table_entry(&table[7], 8, "busy_factor", &sd->busy_factor, 5282 set_table_entry(&table[7], "busy_factor", &sd->busy_factor,
5266 sizeof(int), 0644, proc_dointvec_minmax); 5283 sizeof(int), 0644, proc_dointvec_minmax);
5267 set_table_entry(&table[8], 9, "imbalance_pct", &sd->imbalance_pct, 5284 set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct,
5268 sizeof(int), 0644, proc_dointvec_minmax); 5285 sizeof(int), 0644, proc_dointvec_minmax);
5269 set_table_entry(&table[10], 11, "cache_nice_tries", 5286 set_table_entry(&table[10], "cache_nice_tries",
5270 &sd->cache_nice_tries, 5287 &sd->cache_nice_tries,
5271 sizeof(int), 0644, proc_dointvec_minmax); 5288 sizeof(int), 0644, proc_dointvec_minmax);
5272 set_table_entry(&table[12], 13, "flags", &sd->flags, 5289 set_table_entry(&table[12], "flags", &sd->flags,
5273 sizeof(int), 0644, proc_dointvec_minmax); 5290 sizeof(int), 0644, proc_dointvec_minmax);
5274 5291
5275 return table; 5292 return table;
@@ -5289,7 +5306,6 @@ static ctl_table *sd_alloc_ctl_cpu_table(int cpu)
5289 i = 0; 5306 i = 0;
5290 for_each_domain(cpu, sd) { 5307 for_each_domain(cpu, sd) {
5291 snprintf(buf, 32, "domain%d", i); 5308 snprintf(buf, 32, "domain%d", i);
5292 entry->ctl_name = i + 1;
5293 entry->procname = kstrdup(buf, GFP_KERNEL); 5309 entry->procname = kstrdup(buf, GFP_KERNEL);
5294 entry->mode = 0755; 5310 entry->mode = 0755;
5295 entry->child = sd_alloc_ctl_domain_table(sd); 5311 entry->child = sd_alloc_ctl_domain_table(sd);
@@ -5310,7 +5326,6 @@ static void init_sched_domain_sysctl(void)
5310 5326
5311 for (i = 0; i < cpu_num; i++, entry++) { 5327 for (i = 0; i < cpu_num; i++, entry++) {
5312 snprintf(buf, 32, "cpu%d", i); 5328 snprintf(buf, 32, "cpu%d", i);
5313 entry->ctl_name = i + 1;
5314 entry->procname = kstrdup(buf, GFP_KERNEL); 5329 entry->procname = kstrdup(buf, GFP_KERNEL);
5315 entry->mode = 0755; 5330 entry->mode = 0755;
5316 entry->child = sd_alloc_ctl_cpu_table(i); 5331 entry->child = sd_alloc_ctl_cpu_table(i);
@@ -5379,6 +5394,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
5379 rq->migration_thread = NULL; 5394 rq->migration_thread = NULL;
5380 /* Idle task back to normal (off runqueue, low prio) */ 5395 /* Idle task back to normal (off runqueue, low prio) */
5381 rq = task_rq_lock(rq->idle, &flags); 5396 rq = task_rq_lock(rq->idle, &flags);
5397 update_rq_clock(rq);
5382 deactivate_task(rq, rq->idle, 0); 5398 deactivate_task(rq, rq->idle, 0);
5383 rq->idle->static_prio = MAX_PRIO; 5399 rq->idle->static_prio = MAX_PRIO;
5384 __setscheduler(rq, rq->idle, SCHED_NORMAL, 0); 5400 __setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
@@ -6616,12 +6632,13 @@ void normalize_rt_tasks(void)
6616 goto out_unlock; 6632 goto out_unlock;
6617#endif 6633#endif
6618 6634
6635 update_rq_clock(rq);
6619 on_rq = p->se.on_rq; 6636 on_rq = p->se.on_rq;
6620 if (on_rq) 6637 if (on_rq)
6621 deactivate_task(task_rq(p), p, 0); 6638 deactivate_task(rq, p, 0);
6622 __setscheduler(rq, p, SCHED_NORMAL, 0); 6639 __setscheduler(rq, p, SCHED_NORMAL, 0);
6623 if (on_rq) { 6640 if (on_rq) {
6624 activate_task(task_rq(p), p, 0); 6641 activate_task(rq, p, 0);
6625 resched_task(rq->curr); 6642 resched_task(rq->curr);
6626 } 6643 }
6627#ifdef CONFIG_SMP 6644#ifdef CONFIG_SMP
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 8421b9399e10..3da32156394e 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -29,7 +29,7 @@
29 } while (0) 29 } while (0)
30 30
31static void 31static void
32print_task(struct seq_file *m, struct rq *rq, struct task_struct *p, u64 now) 32print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
33{ 33{
34 if (rq->curr == p) 34 if (rq->curr == p)
35 SEQ_printf(m, "R"); 35 SEQ_printf(m, "R");
@@ -56,7 +56,7 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p, u64 now)
56#endif 56#endif
57} 57}
58 58
59static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu, u64 now) 59static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
60{ 60{
61 struct task_struct *g, *p; 61 struct task_struct *g, *p;
62 62
@@ -77,7 +77,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu, u64 now)
77 if (!p->se.on_rq || task_cpu(p) != rq_cpu) 77 if (!p->se.on_rq || task_cpu(p) != rq_cpu)
78 continue; 78 continue;
79 79
80 print_task(m, rq, p, now); 80 print_task(m, rq, p);
81 } while_each_thread(g, p); 81 } while_each_thread(g, p);
82 82
83 read_unlock_irq(&tasklist_lock); 83 read_unlock_irq(&tasklist_lock);
@@ -106,7 +106,7 @@ print_cfs_rq_runtime_sum(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
106 (long long)wait_runtime_rq_sum); 106 (long long)wait_runtime_rq_sum);
107} 107}
108 108
109void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq, u64 now) 109void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
110{ 110{
111 SEQ_printf(m, "\ncfs_rq %p\n", cfs_rq); 111 SEQ_printf(m, "\ncfs_rq %p\n", cfs_rq);
112 112
@@ -124,7 +124,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq, u64 now)
124 print_cfs_rq_runtime_sum(m, cpu, cfs_rq); 124 print_cfs_rq_runtime_sum(m, cpu, cfs_rq);
125} 125}
126 126
127static void print_cpu(struct seq_file *m, int cpu, u64 now) 127static void print_cpu(struct seq_file *m, int cpu)
128{ 128{
129 struct rq *rq = &per_cpu(runqueues, cpu); 129 struct rq *rq = &per_cpu(runqueues, cpu);
130 130
@@ -166,9 +166,9 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
166 P(cpu_load[4]); 166 P(cpu_load[4]);
167#undef P 167#undef P
168 168
169 print_cfs_stats(m, cpu, now); 169 print_cfs_stats(m, cpu);
170 170
171 print_rq(m, rq, cpu, now); 171 print_rq(m, rq, cpu);
172} 172}
173 173
174static int sched_debug_show(struct seq_file *m, void *v) 174static int sched_debug_show(struct seq_file *m, void *v)
@@ -184,7 +184,7 @@ static int sched_debug_show(struct seq_file *m, void *v)
184 SEQ_printf(m, "now at %Lu nsecs\n", (unsigned long long)now); 184 SEQ_printf(m, "now at %Lu nsecs\n", (unsigned long long)now);
185 185
186 for_each_online_cpu(cpu) 186 for_each_online_cpu(cpu)
187 print_cpu(m, cpu, now); 187 print_cpu(m, cpu);
188 188
189 SEQ_printf(m, "\n"); 189 SEQ_printf(m, "\n");
190 190
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6f579ff5a9bc..e91db32cadfd 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -222,21 +222,25 @@ niced_granularity(struct sched_entity *curr, unsigned long granularity)
222{ 222{
223 u64 tmp; 223 u64 tmp;
224 224
225 if (likely(curr->load.weight == NICE_0_LOAD))
226 return granularity;
225 /* 227 /*
226 * Negative nice levels get the same granularity as nice-0: 228 * Positive nice levels get the same granularity as nice-0:
227 */ 229 */
228 if (likely(curr->load.weight >= NICE_0_LOAD)) 230 if (likely(curr->load.weight < NICE_0_LOAD)) {
229 return granularity; 231 tmp = curr->load.weight * (u64)granularity;
232 return (long) (tmp >> NICE_0_SHIFT);
233 }
230 /* 234 /*
231 * Positive nice level tasks get linearly finer 235 * Negative nice level tasks get linearly finer
232 * granularity: 236 * granularity:
233 */ 237 */
234 tmp = curr->load.weight * (u64)granularity; 238 tmp = curr->load.inv_weight * (u64)granularity;
235 239
236 /* 240 /*
237 * It will always fit into 'long': 241 * It will always fit into 'long':
238 */ 242 */
239 return (long) (tmp >> NICE_0_SHIFT); 243 return (long) (tmp >> WMULT_SHIFT);
240} 244}
241 245
242static inline void 246static inline void
@@ -281,26 +285,25 @@ add_wait_runtime(struct cfs_rq *cfs_rq, struct sched_entity *se, long delta)
281 * are not in our scheduling class. 285 * are not in our scheduling class.
282 */ 286 */
283static inline void 287static inline void
284__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, u64 now) 288__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr)
285{ 289{
286 unsigned long delta, delta_exec, delta_fair; 290 unsigned long delta, delta_exec, delta_fair, delta_mine;
287 long delta_mine;
288 struct load_weight *lw = &cfs_rq->load; 291 struct load_weight *lw = &cfs_rq->load;
289 unsigned long load = lw->weight; 292 unsigned long load = lw->weight;
290 293
291 if (unlikely(!load))
292 return;
293
294 delta_exec = curr->delta_exec; 294 delta_exec = curr->delta_exec;
295 schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max)); 295 schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));
296 296
297 curr->sum_exec_runtime += delta_exec; 297 curr->sum_exec_runtime += delta_exec;
298 cfs_rq->exec_clock += delta_exec; 298 cfs_rq->exec_clock += delta_exec;
299 299
300 if (unlikely(!load))
301 return;
302
300 delta_fair = calc_delta_fair(delta_exec, lw); 303 delta_fair = calc_delta_fair(delta_exec, lw);
301 delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw); 304 delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw);
302 305
303 if (cfs_rq->sleeper_bonus > sysctl_sched_stat_granularity) { 306 if (cfs_rq->sleeper_bonus > sysctl_sched_granularity) {
304 delta = calc_delta_mine(cfs_rq->sleeper_bonus, 307 delta = calc_delta_mine(cfs_rq->sleeper_bonus,
305 curr->load.weight, lw); 308 curr->load.weight, lw);
306 if (unlikely(delta > cfs_rq->sleeper_bonus)) 309 if (unlikely(delta > cfs_rq->sleeper_bonus))
@@ -321,7 +324,7 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, u64 now)
321 add_wait_runtime(cfs_rq, curr, delta_mine - delta_exec); 324 add_wait_runtime(cfs_rq, curr, delta_mine - delta_exec);
322} 325}
323 326
324static void update_curr(struct cfs_rq *cfs_rq, u64 now) 327static void update_curr(struct cfs_rq *cfs_rq)
325{ 328{
326 struct sched_entity *curr = cfs_rq_curr(cfs_rq); 329 struct sched_entity *curr = cfs_rq_curr(cfs_rq);
327 unsigned long delta_exec; 330 unsigned long delta_exec;
@@ -334,22 +337,22 @@ static void update_curr(struct cfs_rq *cfs_rq, u64 now)
334 * since the last time we changed load (this cannot 337 * since the last time we changed load (this cannot
335 * overflow on 32 bits): 338 * overflow on 32 bits):
336 */ 339 */
337 delta_exec = (unsigned long)(now - curr->exec_start); 340 delta_exec = (unsigned long)(rq_of(cfs_rq)->clock - curr->exec_start);
338 341
339 curr->delta_exec += delta_exec; 342 curr->delta_exec += delta_exec;
340 343
341 if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) { 344 if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) {
342 __update_curr(cfs_rq, curr, now); 345 __update_curr(cfs_rq, curr);
343 curr->delta_exec = 0; 346 curr->delta_exec = 0;
344 } 347 }
345 curr->exec_start = now; 348 curr->exec_start = rq_of(cfs_rq)->clock;
346} 349}
347 350
348static inline void 351static inline void
349update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now) 352update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
350{ 353{
351 se->wait_start_fair = cfs_rq->fair_clock; 354 se->wait_start_fair = cfs_rq->fair_clock;
352 schedstat_set(se->wait_start, now); 355 schedstat_set(se->wait_start, rq_of(cfs_rq)->clock);
353} 356}
354 357
355/* 358/*
@@ -377,8 +380,7 @@ calc_weighted(unsigned long delta, unsigned long weight, int shift)
377/* 380/*
378 * Task is being enqueued - update stats: 381 * Task is being enqueued - update stats:
379 */ 382 */
380static void 383static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
381update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
382{ 384{
383 s64 key; 385 s64 key;
384 386
@@ -387,7 +389,7 @@ update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
387 * a dequeue/enqueue event is a NOP) 389 * a dequeue/enqueue event is a NOP)
388 */ 390 */
389 if (se != cfs_rq_curr(cfs_rq)) 391 if (se != cfs_rq_curr(cfs_rq))
390 update_stats_wait_start(cfs_rq, se, now); 392 update_stats_wait_start(cfs_rq, se);
391 /* 393 /*
392 * Update the key: 394 * Update the key:
393 */ 395 */
@@ -407,7 +409,8 @@ update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
407 (WMULT_SHIFT - NICE_0_SHIFT); 409 (WMULT_SHIFT - NICE_0_SHIFT);
408 } else { 410 } else {
409 tmp = se->wait_runtime; 411 tmp = se->wait_runtime;
410 key -= (tmp * se->load.weight) >> NICE_0_SHIFT; 412 key -= (tmp * se->load.inv_weight) >>
413 (WMULT_SHIFT - NICE_0_SHIFT);
411 } 414 }
412 } 415 }
413 416
@@ -418,11 +421,12 @@ update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
418 * Note: must be called with a freshly updated rq->fair_clock. 421 * Note: must be called with a freshly updated rq->fair_clock.
419 */ 422 */
420static inline void 423static inline void
421__update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now) 424__update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
422{ 425{
423 unsigned long delta_fair = se->delta_fair_run; 426 unsigned long delta_fair = se->delta_fair_run;
424 427
425 schedstat_set(se->wait_max, max(se->wait_max, now - se->wait_start)); 428 schedstat_set(se->wait_max, max(se->wait_max,
429 rq_of(cfs_rq)->clock - se->wait_start));
426 430
427 if (unlikely(se->load.weight != NICE_0_LOAD)) 431 if (unlikely(se->load.weight != NICE_0_LOAD))
428 delta_fair = calc_weighted(delta_fair, se->load.weight, 432 delta_fair = calc_weighted(delta_fair, se->load.weight,
@@ -432,7 +436,7 @@ __update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
432} 436}
433 437
434static void 438static void
435update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now) 439update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
436{ 440{
437 unsigned long delta_fair; 441 unsigned long delta_fair;
438 442
@@ -442,7 +446,7 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
442 se->delta_fair_run += delta_fair; 446 se->delta_fair_run += delta_fair;
443 if (unlikely(abs(se->delta_fair_run) >= 447 if (unlikely(abs(se->delta_fair_run) >=
444 sysctl_sched_stat_granularity)) { 448 sysctl_sched_stat_granularity)) {
445 __update_stats_wait_end(cfs_rq, se, now); 449 __update_stats_wait_end(cfs_rq, se);
446 se->delta_fair_run = 0; 450 se->delta_fair_run = 0;
447 } 451 }
448 452
@@ -451,34 +455,34 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
451} 455}
452 456
453static inline void 457static inline void
454update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now) 458update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
455{ 459{
456 update_curr(cfs_rq, now); 460 update_curr(cfs_rq);
457 /* 461 /*
458 * Mark the end of the wait period if dequeueing a 462 * Mark the end of the wait period if dequeueing a
459 * waiting task: 463 * waiting task:
460 */ 464 */
461 if (se != cfs_rq_curr(cfs_rq)) 465 if (se != cfs_rq_curr(cfs_rq))
462 update_stats_wait_end(cfs_rq, se, now); 466 update_stats_wait_end(cfs_rq, se);
463} 467}
464 468
465/* 469/*
466 * We are picking a new current task - update its stats: 470 * We are picking a new current task - update its stats:
467 */ 471 */
468static inline void 472static inline void
469update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now) 473update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
470{ 474{
471 /* 475 /*
472 * We are starting a new run period: 476 * We are starting a new run period:
473 */ 477 */
474 se->exec_start = now; 478 se->exec_start = rq_of(cfs_rq)->clock;
475} 479}
476 480
477/* 481/*
478 * We are descheduling a task - update its stats: 482 * We are descheduling a task - update its stats:
479 */ 483 */
480static inline void 484static inline void
481update_stats_curr_end(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now) 485update_stats_curr_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
482{ 486{
483 se->exec_start = 0; 487 se->exec_start = 0;
484} 488}
@@ -487,8 +491,7 @@ update_stats_curr_end(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
487 * Scheduling class queueing methods: 491 * Scheduling class queueing methods:
488 */ 492 */
489 493
490static void 494static void __enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
491__enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
492{ 495{
493 unsigned long load = cfs_rq->load.weight, delta_fair; 496 unsigned long load = cfs_rq->load.weight, delta_fair;
494 long prev_runtime; 497 long prev_runtime;
@@ -522,8 +525,7 @@ __enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
522 schedstat_add(cfs_rq, wait_runtime, se->wait_runtime); 525 schedstat_add(cfs_rq, wait_runtime, se->wait_runtime);
523} 526}
524 527
525static void 528static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
526enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
527{ 529{
528 struct task_struct *tsk = task_of(se); 530 struct task_struct *tsk = task_of(se);
529 unsigned long delta_fair; 531 unsigned long delta_fair;
@@ -538,7 +540,7 @@ enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
538 se->delta_fair_sleep += delta_fair; 540 se->delta_fair_sleep += delta_fair;
539 if (unlikely(abs(se->delta_fair_sleep) >= 541 if (unlikely(abs(se->delta_fair_sleep) >=
540 sysctl_sched_stat_granularity)) { 542 sysctl_sched_stat_granularity)) {
541 __enqueue_sleeper(cfs_rq, se, now); 543 __enqueue_sleeper(cfs_rq, se);
542 se->delta_fair_sleep = 0; 544 se->delta_fair_sleep = 0;
543 } 545 }
544 546
@@ -546,7 +548,7 @@ enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
546 548
547#ifdef CONFIG_SCHEDSTATS 549#ifdef CONFIG_SCHEDSTATS
548 if (se->sleep_start) { 550 if (se->sleep_start) {
549 u64 delta = now - se->sleep_start; 551 u64 delta = rq_of(cfs_rq)->clock - se->sleep_start;
550 552
551 if ((s64)delta < 0) 553 if ((s64)delta < 0)
552 delta = 0; 554 delta = 0;
@@ -558,7 +560,7 @@ enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
558 se->sum_sleep_runtime += delta; 560 se->sum_sleep_runtime += delta;
559 } 561 }
560 if (se->block_start) { 562 if (se->block_start) {
561 u64 delta = now - se->block_start; 563 u64 delta = rq_of(cfs_rq)->clock - se->block_start;
562 564
563 if ((s64)delta < 0) 565 if ((s64)delta < 0)
564 delta = 0; 566 delta = 0;
@@ -573,26 +575,24 @@ enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
573} 575}
574 576
575static void 577static void
576enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, 578enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
577 int wakeup, u64 now)
578{ 579{
579 /* 580 /*
580 * Update the fair clock. 581 * Update the fair clock.
581 */ 582 */
582 update_curr(cfs_rq, now); 583 update_curr(cfs_rq);
583 584
584 if (wakeup) 585 if (wakeup)
585 enqueue_sleeper(cfs_rq, se, now); 586 enqueue_sleeper(cfs_rq, se);
586 587
587 update_stats_enqueue(cfs_rq, se, now); 588 update_stats_enqueue(cfs_rq, se);
588 __enqueue_entity(cfs_rq, se); 589 __enqueue_entity(cfs_rq, se);
589} 590}
590 591
591static void 592static void
592dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, 593dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
593 int sleep, u64 now)
594{ 594{
595 update_stats_dequeue(cfs_rq, se, now); 595 update_stats_dequeue(cfs_rq, se);
596 if (sleep) { 596 if (sleep) {
597 se->sleep_start_fair = cfs_rq->fair_clock; 597 se->sleep_start_fair = cfs_rq->fair_clock;
598#ifdef CONFIG_SCHEDSTATS 598#ifdef CONFIG_SCHEDSTATS
@@ -600,9 +600,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
600 struct task_struct *tsk = task_of(se); 600 struct task_struct *tsk = task_of(se);
601 601
602 if (tsk->state & TASK_INTERRUPTIBLE) 602 if (tsk->state & TASK_INTERRUPTIBLE)
603 se->sleep_start = now; 603 se->sleep_start = rq_of(cfs_rq)->clock;
604 if (tsk->state & TASK_UNINTERRUPTIBLE) 604 if (tsk->state & TASK_UNINTERRUPTIBLE)
605 se->block_start = now; 605 se->block_start = rq_of(cfs_rq)->clock;
606 } 606 }
607 cfs_rq->wait_runtime -= se->wait_runtime; 607 cfs_rq->wait_runtime -= se->wait_runtime;
608#endif 608#endif
@@ -629,7 +629,7 @@ __check_preempt_curr_fair(struct cfs_rq *cfs_rq, struct sched_entity *se,
629} 629}
630 630
631static inline void 631static inline void
632set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now) 632set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
633{ 633{
634 /* 634 /*
635 * Any task has to be enqueued before it get to execute on 635 * Any task has to be enqueued before it get to execute on
@@ -638,49 +638,46 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
638 * done a put_prev_task_fair() shortly before this, which 638 * done a put_prev_task_fair() shortly before this, which
639 * updated rq->fair_clock - used by update_stats_wait_end()) 639 * updated rq->fair_clock - used by update_stats_wait_end())
640 */ 640 */
641 update_stats_wait_end(cfs_rq, se, now); 641 update_stats_wait_end(cfs_rq, se);
642 update_stats_curr_start(cfs_rq, se, now); 642 update_stats_curr_start(cfs_rq, se);
643 set_cfs_rq_curr(cfs_rq, se); 643 set_cfs_rq_curr(cfs_rq, se);
644} 644}
645 645
646static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq, u64 now) 646static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
647{ 647{
648 struct sched_entity *se = __pick_next_entity(cfs_rq); 648 struct sched_entity *se = __pick_next_entity(cfs_rq);
649 649
650 set_next_entity(cfs_rq, se, now); 650 set_next_entity(cfs_rq, se);
651 651
652 return se; 652 return se;
653} 653}
654 654
655static void 655static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
656put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev, u64 now)
657{ 656{
658 /* 657 /*
659 * If still on the runqueue then deactivate_task() 658 * If still on the runqueue then deactivate_task()
660 * was not called and update_curr() has to be done: 659 * was not called and update_curr() has to be done:
661 */ 660 */
662 if (prev->on_rq) 661 if (prev->on_rq)
663 update_curr(cfs_rq, now); 662 update_curr(cfs_rq);
664 663
665 update_stats_curr_end(cfs_rq, prev, now); 664 update_stats_curr_end(cfs_rq, prev);
666 665
667 if (prev->on_rq) 666 if (prev->on_rq)
668 update_stats_wait_start(cfs_rq, prev, now); 667 update_stats_wait_start(cfs_rq, prev);
669 set_cfs_rq_curr(cfs_rq, NULL); 668 set_cfs_rq_curr(cfs_rq, NULL);
670} 669}
671 670
672static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) 671static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
673{ 672{
674 struct rq *rq = rq_of(cfs_rq);
675 struct sched_entity *next; 673 struct sched_entity *next;
676 u64 now = __rq_clock(rq);
677 674
678 /* 675 /*
679 * Dequeue and enqueue the task to update its 676 * Dequeue and enqueue the task to update its
680 * position within the tree: 677 * position within the tree:
681 */ 678 */
682 dequeue_entity(cfs_rq, curr, 0, now); 679 dequeue_entity(cfs_rq, curr, 0);
683 enqueue_entity(cfs_rq, curr, 0, now); 680 enqueue_entity(cfs_rq, curr, 0);
684 681
685 /* 682 /*
686 * Reschedule if another task tops the current one. 683 * Reschedule if another task tops the current one.
@@ -785,8 +782,7 @@ static inline int is_same_group(struct task_struct *curr, struct task_struct *p)
785 * increased. Here we update the fair scheduling stats and 782 * increased. Here we update the fair scheduling stats and
786 * then put the task into the rbtree: 783 * then put the task into the rbtree:
787 */ 784 */
788static void 785static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
789enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
790{ 786{
791 struct cfs_rq *cfs_rq; 787 struct cfs_rq *cfs_rq;
792 struct sched_entity *se = &p->se; 788 struct sched_entity *se = &p->se;
@@ -795,7 +791,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
795 if (se->on_rq) 791 if (se->on_rq)
796 break; 792 break;
797 cfs_rq = cfs_rq_of(se); 793 cfs_rq = cfs_rq_of(se);
798 enqueue_entity(cfs_rq, se, wakeup, now); 794 enqueue_entity(cfs_rq, se, wakeup);
799 } 795 }
800} 796}
801 797
@@ -804,15 +800,14 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
804 * decreased. We remove the task from the rbtree and 800 * decreased. We remove the task from the rbtree and
805 * update the fair scheduling stats: 801 * update the fair scheduling stats:
806 */ 802 */
807static void 803static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
808dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep, u64 now)
809{ 804{
810 struct cfs_rq *cfs_rq; 805 struct cfs_rq *cfs_rq;
811 struct sched_entity *se = &p->se; 806 struct sched_entity *se = &p->se;
812 807
813 for_each_sched_entity(se) { 808 for_each_sched_entity(se) {
814 cfs_rq = cfs_rq_of(se); 809 cfs_rq = cfs_rq_of(se);
815 dequeue_entity(cfs_rq, se, sleep, now); 810 dequeue_entity(cfs_rq, se, sleep);
816 /* Don't dequeue parent if it has other entities besides us */ 811 /* Don't dequeue parent if it has other entities besides us */
817 if (cfs_rq->load.weight) 812 if (cfs_rq->load.weight)
818 break; 813 break;
@@ -825,14 +820,14 @@ dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep, u64 now)
825static void yield_task_fair(struct rq *rq, struct task_struct *p) 820static void yield_task_fair(struct rq *rq, struct task_struct *p)
826{ 821{
827 struct cfs_rq *cfs_rq = task_cfs_rq(p); 822 struct cfs_rq *cfs_rq = task_cfs_rq(p);
828 u64 now = __rq_clock(rq);
829 823
824 __update_rq_clock(rq);
830 /* 825 /*
831 * Dequeue and enqueue the task to update its 826 * Dequeue and enqueue the task to update its
832 * position within the tree: 827 * position within the tree:
833 */ 828 */
834 dequeue_entity(cfs_rq, &p->se, 0, now); 829 dequeue_entity(cfs_rq, &p->se, 0);
835 enqueue_entity(cfs_rq, &p->se, 0, now); 830 enqueue_entity(cfs_rq, &p->se, 0);
836} 831}
837 832
838/* 833/*
@@ -845,7 +840,8 @@ static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
845 unsigned long gran; 840 unsigned long gran;
846 841
847 if (unlikely(rt_prio(p->prio))) { 842 if (unlikely(rt_prio(p->prio))) {
848 update_curr(cfs_rq, rq_clock(rq)); 843 update_rq_clock(rq);
844 update_curr(cfs_rq);
849 resched_task(curr); 845 resched_task(curr);
850 return; 846 return;
851 } 847 }
@@ -861,7 +857,7 @@ static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
861 __check_preempt_curr_fair(cfs_rq, &p->se, &curr->se, gran); 857 __check_preempt_curr_fair(cfs_rq, &p->se, &curr->se, gran);
862} 858}
863 859
864static struct task_struct *pick_next_task_fair(struct rq *rq, u64 now) 860static struct task_struct *pick_next_task_fair(struct rq *rq)
865{ 861{
866 struct cfs_rq *cfs_rq = &rq->cfs; 862 struct cfs_rq *cfs_rq = &rq->cfs;
867 struct sched_entity *se; 863 struct sched_entity *se;
@@ -870,7 +866,7 @@ static struct task_struct *pick_next_task_fair(struct rq *rq, u64 now)
870 return NULL; 866 return NULL;
871 867
872 do { 868 do {
873 se = pick_next_entity(cfs_rq, now); 869 se = pick_next_entity(cfs_rq);
874 cfs_rq = group_cfs_rq(se); 870 cfs_rq = group_cfs_rq(se);
875 } while (cfs_rq); 871 } while (cfs_rq);
876 872
@@ -880,14 +876,14 @@ static struct task_struct *pick_next_task_fair(struct rq *rq, u64 now)
880/* 876/*
881 * Account for a descheduled task: 877 * Account for a descheduled task:
882 */ 878 */
883static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, u64 now) 879static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
884{ 880{
885 struct sched_entity *se = &prev->se; 881 struct sched_entity *se = &prev->se;
886 struct cfs_rq *cfs_rq; 882 struct cfs_rq *cfs_rq;
887 883
888 for_each_sched_entity(se) { 884 for_each_sched_entity(se) {
889 cfs_rq = cfs_rq_of(se); 885 cfs_rq = cfs_rq_of(se);
890 put_prev_entity(cfs_rq, se, now); 886 put_prev_entity(cfs_rq, se);
891 } 887 }
892} 888}
893 889
@@ -930,6 +926,7 @@ static struct task_struct *load_balance_next_fair(void *arg)
930 return __load_balance_iterator(cfs_rq, cfs_rq->rb_load_balance_curr); 926 return __load_balance_iterator(cfs_rq, cfs_rq->rb_load_balance_curr);
931} 927}
932 928
929#ifdef CONFIG_FAIR_GROUP_SCHED
933static int cfs_rq_best_prio(struct cfs_rq *cfs_rq) 930static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
934{ 931{
935 struct sched_entity *curr; 932 struct sched_entity *curr;
@@ -943,12 +940,13 @@ static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
943 940
944 return p->prio; 941 return p->prio;
945} 942}
943#endif
946 944
947static int 945static unsigned long
948load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest, 946load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
949 unsigned long max_nr_move, unsigned long max_load_move, 947 unsigned long max_nr_move, unsigned long max_load_move,
950 struct sched_domain *sd, enum cpu_idle_type idle, 948 struct sched_domain *sd, enum cpu_idle_type idle,
951 int *all_pinned, unsigned long *total_load_moved) 949 int *all_pinned, int *this_best_prio)
952{ 950{
953 struct cfs_rq *busy_cfs_rq; 951 struct cfs_rq *busy_cfs_rq;
954 unsigned long load_moved, total_nr_moved = 0, nr_moved; 952 unsigned long load_moved, total_nr_moved = 0, nr_moved;
@@ -959,10 +957,10 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
959 cfs_rq_iterator.next = load_balance_next_fair; 957 cfs_rq_iterator.next = load_balance_next_fair;
960 958
961 for_each_leaf_cfs_rq(busiest, busy_cfs_rq) { 959 for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
960#ifdef CONFIG_FAIR_GROUP_SCHED
962 struct cfs_rq *this_cfs_rq; 961 struct cfs_rq *this_cfs_rq;
963 long imbalance; 962 long imbalances;
964 unsigned long maxload; 963 unsigned long maxload;
965 int this_best_prio, best_prio, best_prio_seen = 0;
966 964
967 this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu); 965 this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
968 966
@@ -976,27 +974,17 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
976 imbalance /= 2; 974 imbalance /= 2;
977 maxload = min(rem_load_move, imbalance); 975 maxload = min(rem_load_move, imbalance);
978 976
979 this_best_prio = cfs_rq_best_prio(this_cfs_rq); 977 *this_best_prio = cfs_rq_best_prio(this_cfs_rq);
980 best_prio = cfs_rq_best_prio(busy_cfs_rq); 978#else
981 979#define maxload rem_load_move
982 /* 980#endif
983 * Enable handling of the case where there is more than one task
984 * with the best priority. If the current running task is one
985 * of those with prio==best_prio we know it won't be moved
986 * and therefore it's safe to override the skip (based on load)
987 * of any task we find with that prio.
988 */
989 if (cfs_rq_curr(busy_cfs_rq) == &busiest->curr->se)
990 best_prio_seen = 1;
991
992 /* pass busy_cfs_rq argument into 981 /* pass busy_cfs_rq argument into
993 * load_balance_[start|next]_fair iterators 982 * load_balance_[start|next]_fair iterators
994 */ 983 */
995 cfs_rq_iterator.arg = busy_cfs_rq; 984 cfs_rq_iterator.arg = busy_cfs_rq;
996 nr_moved = balance_tasks(this_rq, this_cpu, busiest, 985 nr_moved = balance_tasks(this_rq, this_cpu, busiest,
997 max_nr_move, maxload, sd, idle, all_pinned, 986 max_nr_move, maxload, sd, idle, all_pinned,
998 &load_moved, this_best_prio, best_prio, 987 &load_moved, this_best_prio, &cfs_rq_iterator);
999 best_prio_seen, &cfs_rq_iterator);
1000 988
1001 total_nr_moved += nr_moved; 989 total_nr_moved += nr_moved;
1002 max_nr_move -= nr_moved; 990 max_nr_move -= nr_moved;
@@ -1006,9 +994,7 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
1006 break; 994 break;
1007 } 995 }
1008 996
1009 *total_load_moved = max_load_move - rem_load_move; 997 return max_load_move - rem_load_move;
1010
1011 return total_nr_moved;
1012} 998}
1013 999
1014/* 1000/*
@@ -1032,14 +1018,14 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr)
1032 * monopolize the CPU. Note: the parent runqueue is locked, 1018 * monopolize the CPU. Note: the parent runqueue is locked,
1033 * the child is not running yet. 1019 * the child is not running yet.
1034 */ 1020 */
1035static void task_new_fair(struct rq *rq, struct task_struct *p, u64 now) 1021static void task_new_fair(struct rq *rq, struct task_struct *p)
1036{ 1022{
1037 struct cfs_rq *cfs_rq = task_cfs_rq(p); 1023 struct cfs_rq *cfs_rq = task_cfs_rq(p);
1038 struct sched_entity *se = &p->se; 1024 struct sched_entity *se = &p->se;
1039 1025
1040 sched_info_queued(p); 1026 sched_info_queued(p);
1041 1027
1042 update_stats_enqueue(cfs_rq, se, now); 1028 update_stats_enqueue(cfs_rq, se);
1043 /* 1029 /*
1044 * Child runs first: we let it run before the parent 1030 * Child runs first: we let it run before the parent
1045 * until it reschedules once. We set up the key so that 1031 * until it reschedules once. We set up the key so that
@@ -1072,15 +1058,10 @@ static void task_new_fair(struct rq *rq, struct task_struct *p, u64 now)
1072 */ 1058 */
1073static void set_curr_task_fair(struct rq *rq) 1059static void set_curr_task_fair(struct rq *rq)
1074{ 1060{
1075 struct task_struct *curr = rq->curr; 1061 struct sched_entity *se = &rq->curr.se;
1076 struct sched_entity *se = &curr->se;
1077 u64 now = rq_clock(rq);
1078 struct cfs_rq *cfs_rq;
1079 1062
1080 for_each_sched_entity(se) { 1063 for_each_sched_entity(se)
1081 cfs_rq = cfs_rq_of(se); 1064 set_next_entity(cfs_rq_of(se), se);
1082 set_next_entity(cfs_rq, se, now);
1083 }
1084} 1065}
1085#else 1066#else
1086static void set_curr_task_fair(struct rq *rq) 1067static void set_curr_task_fair(struct rq *rq)
@@ -1109,12 +1090,11 @@ struct sched_class fair_sched_class __read_mostly = {
1109}; 1090};
1110 1091
1111#ifdef CONFIG_SCHED_DEBUG 1092#ifdef CONFIG_SCHED_DEBUG
1112void print_cfs_stats(struct seq_file *m, int cpu, u64 now) 1093static void print_cfs_stats(struct seq_file *m, int cpu)
1113{ 1094{
1114 struct rq *rq = cpu_rq(cpu);
1115 struct cfs_rq *cfs_rq; 1095 struct cfs_rq *cfs_rq;
1116 1096
1117 for_each_leaf_cfs_rq(rq, cfs_rq) 1097 for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
1118 print_cfs_rq(m, cpu, cfs_rq, now); 1098 print_cfs_rq(m, cpu, cfs_rq);
1119} 1099}
1120#endif 1100#endif
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 41841e741c4a..3503fb2d9f96 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -13,7 +13,7 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p)
13 resched_task(rq->idle); 13 resched_task(rq->idle);
14} 14}
15 15
16static struct task_struct *pick_next_task_idle(struct rq *rq, u64 now) 16static struct task_struct *pick_next_task_idle(struct rq *rq)
17{ 17{
18 schedstat_inc(rq, sched_goidle); 18 schedstat_inc(rq, sched_goidle);
19 19
@@ -25,7 +25,7 @@ static struct task_struct *pick_next_task_idle(struct rq *rq, u64 now)
25 * message if some code attempts to do it: 25 * message if some code attempts to do it:
26 */ 26 */
27static void 27static void
28dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep, u64 now) 28dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep)
29{ 29{
30 spin_unlock_irq(&rq->lock); 30 spin_unlock_irq(&rq->lock);
31 printk(KERN_ERR "bad: scheduling from the idle thread!\n"); 31 printk(KERN_ERR "bad: scheduling from the idle thread!\n");
@@ -33,15 +33,15 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep, u64 now)
33 spin_lock_irq(&rq->lock); 33 spin_lock_irq(&rq->lock);
34} 34}
35 35
36static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, u64 now) 36static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
37{ 37{
38} 38}
39 39
40static int 40static unsigned long
41load_balance_idle(struct rq *this_rq, int this_cpu, struct rq *busiest, 41load_balance_idle(struct rq *this_rq, int this_cpu, struct rq *busiest,
42 unsigned long max_nr_move, unsigned long max_load_move, 42 unsigned long max_nr_move, unsigned long max_load_move,
43 struct sched_domain *sd, enum cpu_idle_type idle, 43 struct sched_domain *sd, enum cpu_idle_type idle,
44 int *all_pinned, unsigned long *total_load_moved) 44 int *all_pinned, int *this_best_prio)
45{ 45{
46 return 0; 46 return 0;
47} 47}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 002fcf8d3f64..dcdcad632fd9 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -7,7 +7,7 @@
7 * Update the current task's runtime statistics. Skip current tasks that 7 * Update the current task's runtime statistics. Skip current tasks that
8 * are not in our scheduling class. 8 * are not in our scheduling class.
9 */ 9 */
10static inline void update_curr_rt(struct rq *rq, u64 now) 10static inline void update_curr_rt(struct rq *rq)
11{ 11{
12 struct task_struct *curr = rq->curr; 12 struct task_struct *curr = rq->curr;
13 u64 delta_exec; 13 u64 delta_exec;
@@ -15,18 +15,17 @@ static inline void update_curr_rt(struct rq *rq, u64 now)
15 if (!task_has_rt_policy(curr)) 15 if (!task_has_rt_policy(curr))
16 return; 16 return;
17 17
18 delta_exec = now - curr->se.exec_start; 18 delta_exec = rq->clock - curr->se.exec_start;
19 if (unlikely((s64)delta_exec < 0)) 19 if (unlikely((s64)delta_exec < 0))
20 delta_exec = 0; 20 delta_exec = 0;
21 21
22 schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec)); 22 schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
23 23
24 curr->se.sum_exec_runtime += delta_exec; 24 curr->se.sum_exec_runtime += delta_exec;
25 curr->se.exec_start = now; 25 curr->se.exec_start = rq->clock;
26} 26}
27 27
28static void 28static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
29enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
30{ 29{
31 struct rt_prio_array *array = &rq->rt.active; 30 struct rt_prio_array *array = &rq->rt.active;
32 31
@@ -37,12 +36,11 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
37/* 36/*
38 * Adding/removing a task to/from a priority array: 37 * Adding/removing a task to/from a priority array:
39 */ 38 */
40static void 39static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
41dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep, u64 now)
42{ 40{
43 struct rt_prio_array *array = &rq->rt.active; 41 struct rt_prio_array *array = &rq->rt.active;
44 42
45 update_curr_rt(rq, now); 43 update_curr_rt(rq);
46 44
47 list_del(&p->run_list); 45 list_del(&p->run_list);
48 if (list_empty(array->queue + p->prio)) 46 if (list_empty(array->queue + p->prio))
@@ -75,7 +73,7 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p)
75 resched_task(rq->curr); 73 resched_task(rq->curr);
76} 74}
77 75
78static struct task_struct *pick_next_task_rt(struct rq *rq, u64 now) 76static struct task_struct *pick_next_task_rt(struct rq *rq)
79{ 77{
80 struct rt_prio_array *array = &rq->rt.active; 78 struct rt_prio_array *array = &rq->rt.active;
81 struct task_struct *next; 79 struct task_struct *next;
@@ -89,14 +87,14 @@ static struct task_struct *pick_next_task_rt(struct rq *rq, u64 now)
89 queue = array->queue + idx; 87 queue = array->queue + idx;
90 next = list_entry(queue->next, struct task_struct, run_list); 88 next = list_entry(queue->next, struct task_struct, run_list);
91 89
92 next->se.exec_start = now; 90 next->se.exec_start = rq->clock;
93 91
94 return next; 92 return next;
95} 93}
96 94
97static void put_prev_task_rt(struct rq *rq, struct task_struct *p, u64 now) 95static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
98{ 96{
99 update_curr_rt(rq, now); 97 update_curr_rt(rq);
100 p->se.exec_start = 0; 98 p->se.exec_start = 0;
101} 99}
102 100
@@ -172,28 +170,15 @@ static struct task_struct *load_balance_next_rt(void *arg)
172 return p; 170 return p;
173} 171}
174 172
175static int 173static unsigned long
176load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest, 174load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
177 unsigned long max_nr_move, unsigned long max_load_move, 175 unsigned long max_nr_move, unsigned long max_load_move,
178 struct sched_domain *sd, enum cpu_idle_type idle, 176 struct sched_domain *sd, enum cpu_idle_type idle,
179 int *all_pinned, unsigned long *load_moved) 177 int *all_pinned, int *this_best_prio)
180{ 178{
181 int this_best_prio, best_prio, best_prio_seen = 0;
182 int nr_moved; 179 int nr_moved;
183 struct rq_iterator rt_rq_iterator; 180 struct rq_iterator rt_rq_iterator;
184 181 unsigned long load_moved;
185 best_prio = sched_find_first_bit(busiest->rt.active.bitmap);
186 this_best_prio = sched_find_first_bit(this_rq->rt.active.bitmap);
187
188 /*
189 * Enable handling of the case where there is more than one task
190 * with the best priority. If the current running task is one
191 * of those with prio==best_prio we know it won't be moved
192 * and therefore it's safe to override the skip (based on load)
193 * of any task we find with that prio.
194 */
195 if (busiest->curr->prio == best_prio)
196 best_prio_seen = 1;
197 182
198 rt_rq_iterator.start = load_balance_start_rt; 183 rt_rq_iterator.start = load_balance_start_rt;
199 rt_rq_iterator.next = load_balance_next_rt; 184 rt_rq_iterator.next = load_balance_next_rt;
@@ -203,11 +188,10 @@ load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
203 rt_rq_iterator.arg = busiest; 188 rt_rq_iterator.arg = busiest;
204 189
205 nr_moved = balance_tasks(this_rq, this_cpu, busiest, max_nr_move, 190 nr_moved = balance_tasks(this_rq, this_cpu, busiest, max_nr_move,
206 max_load_move, sd, idle, all_pinned, load_moved, 191 max_load_move, sd, idle, all_pinned, &load_moved,
207 this_best_prio, best_prio, best_prio_seen, 192 this_best_prio, &rt_rq_iterator);
208 &rt_rq_iterator);
209 193
210 return nr_moved; 194 return load_moved;
211} 195}
212 196
213static void task_tick_rt(struct rq *rq, struct task_struct *p) 197static void task_tick_rt(struct rq *rq, struct task_struct *p)