sched, cputime: Introduce thread_group_times()

This is a real fix for problem of utime/stime values decreasing described in the thread: http://lkml.org/lkml/2009/11/3/522 Now cputime is accounted in the following way: - {u,s}time in task_struct are increased every time when the thread is interrupted by a tick (timer interrupt). - When a thread exits, its {u,s}time are added to signal->{u,s}time, after adjusted by task_times(). - When all threads in a thread_group exits, accumulated {u,s}time (and also c{u,s}time) in signal struct are added to c{u,s}time in signal struct of the group's parent. So {u,s}time in task struct are "raw" tick count, while {u,s}time and c{u,s}time in signal struct are "adjusted" values. And accounted values are used by: - task_times(), to get cputime of a thread: This function returns adjusted values that originates from raw {u,s}time and scaled by sum_exec_runtime that accounted by CFS. - thread_group_cputime(), to get cputime of a thread group: This function returns sum of all {u,s}time of living threads in the group, plus {u,s}time in the signal struct that is sum of adjusted cputimes of all exited threads belonged to the group. The problem is the return value of thread_group_cputime(), because it is mixed sum of "raw" value and "adjusted" value: group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time) This misbehavior can break {u,s}time monotonicity. Assume that if there is a thread that have raw values greater than adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only runs 45ms) and if it exits, cputime will decrease (e.g. -5ms). To fix this, we could do: group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time) But task_times() contains hard divisions, so applying it for every thread should be avoided. This patch fixes the above problem in the following way: - Modify thread's exit (= __exit_signal()) not to use task_times(). It means {u,s}time in signal struct accumulates raw values instead of adjusted values. As the result it makes thread_group_cputime() to return pure sum of "raw" values. - Introduce a new function thread_group_times(*task, *utime, *stime) that converts "raw" values of thread_group_cputime() to "adjusted" values, in same calculation procedure as task_times(). - Modify group's exit (= wait_task_zombie()) to use this introduced thread_group_times(). It make c{u,s}time in signal struct to have adjusted values like before this patch. - Replace some thread_group_cputime() by thread_group_times(). This replacements are only applied where conveys the "adjusted" cputime to users, and where already uses task_times() near by it. (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.) This patch have a positive side effect: - Before this patch, if a group contains many short-life threads (e.g. runs 0.9ms and not interrupted by ticks), the group's cputime could be invisible since thread's cputime was accumulated after adjusted: imagine adjustment function as adj(ticks, runtime), {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0. After this patch it will not happen because the adjustment is applied after accumulated. v2: - remove if()s, put new variables into signal_struct. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Spencer Candland <spencer@bluehost.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4B162517.8040909@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
author: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 2009-12-02 03:28:07 -0500
committer: Ingo Molnar <mingo@elte.hu> 2009-12-02 11:32:40 -0500
commit: 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239 (patch)
tree: 6102662a9594d51155bee11666fe8517fcbe6039 /kernel/sched.c
parent: d99ca3b977fc5a93141304f571475c2af9e6c1c5 (diff)
1 files changed, 41 insertions, 0 deletions
diff --git a/kernel/sched.c b/kernel/sched.c
index 17e2c1db2bde..e6ba726941ae 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5187,6 +5187,16 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
        *ut = p->utime;
        *st = p->stime;
 }
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+        struct task_cputime cputime;
+        thread_group_cputime(p, &cputime);
+        *ut = cputime.utime;
+        *st = cputime.stime;
+}
 #else
 #ifndef nsecs_to_cputime
@@ -5220,6 +5230,37 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
        *ut = p->prev_utime;
        *st = p->prev_stime;
 }
+/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+        struct signal_struct *sig = p->signal;
+        struct task_cputime cputime;
+        cputime_t rtime, utime, total;
+        thread_group_cputime(p, &cputime);
+        total = cputime_add(cputime.utime, cputime.stime);
+        rtime = nsecs_to_cputime(cputime.sum_exec_runtime);
+        if (total) {
+                u64 temp;
+                temp = (u64)(rtime * cputime.utime);
+                do_div(temp, total);
+                utime = (cputime_t)temp;
+        } else
+                utime = rtime;
+        sig->prev_utime = max(sig->prev_utime, utime);
+        sig->prev_stime = max(sig->prev_stime,
+                              cputime_sub(rtime, sig->prev_utime));
+        *ut = sig->prev_utime;
+        *st = sig->prev_stime;
+}
 #endif
 /*
author	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>	2009-12-02 03:28:07 -0500
committer	Ingo Molnar <mingo@elte.hu>	2009-12-02 11:32:40 -0500
commit	0cf55e1ec08bb5a22e068309e2d8ba1180ab4239 (patch)
tree	6102662a9594d51155bee11666fe8517fcbe6039 /kernel/sched.c
parent	d99ca3b977fc5a93141304f571475c2af9e6c1c5 (diff)

diff --git a/kernel/sched.c b/kernel/sched.c index 17e2c1db2bde..e6ba726941ae 100644 --- a/kernel/sched.c +++ b/kernel/sched.c
@@ -5187,6 +5187,16 @@ void task_times(struct task_struct p, cputime_t ut, cputime_t *st)
5187	*ut = p->utime;	5187	*ut = p->utime;
5188	*st = p->stime;	5188	*st = p->stime;
5189	}	5189	}
		5190
		5191	void thread_group_times(struct task_struct p, cputime_t ut, cputime_t *st)
		5192	{
		5193	struct task_cputime cputime;
		5194
		5195	thread_group_cputime(p, &cputime);
		5196
		5197	*ut = cputime.utime;
		5198	*st = cputime.stime;
		5199	}
5190	#else	5200	#else
5191		5201
5192	#ifndef nsecs_to_cputime	5202	#ifndef nsecs_to_cputime
@@ -5220,6 +5230,37 @@ void task_times(struct task_struct p, cputime_t ut, cputime_t *st)
5220	*ut = p->prev_utime;	5230	*ut = p->prev_utime;
5221	*st = p->prev_stime;	5231	*st = p->prev_stime;
5222	}	5232	}
		5233
		5234	/*
		5235	* Must be called with siglock held.
		5236	*/
		5237	void thread_group_times(struct task_struct p, cputime_t ut, cputime_t *st)
		5238	{
		5239	struct signal_struct *sig = p->signal;
		5240	struct task_cputime cputime;
		5241	cputime_t rtime, utime, total;
		5242
		5243	thread_group_cputime(p, &cputime);
		5244
		5245	total = cputime_add(cputime.utime, cputime.stime);
		5246	rtime = nsecs_to_cputime(cputime.sum_exec_runtime);
		5247
		5248	if (total) {
		5249	u64 temp;
		5250
		5251	temp = (u64)(rtime * cputime.utime);
		5252	do_div(temp, total);
		5253	utime = (cputime_t)temp;
		5254	} else
		5255	utime = rtime;
		5256
		5257	sig->prev_utime = max(sig->prev_utime, utime);
		5258	sig->prev_stime = max(sig->prev_stime,
		5259	cputime_sub(rtime, sig->prev_utime));
		5260
		5261	*ut = sig->prev_utime;
		5262	*st = sig->prev_stime;
		5263	}
5223	#endif	5264	#endif
5224		5265
5225	/*	5266	/*