diff options
author | Ingo Molnar <mingo@elte.hu> | 2008-05-08 05:53:48 -0400 |
---|---|---|
committer | Ingo Molnar <mingo@elte.hu> | 2008-05-08 11:00:42 -0400 |
commit | bf726eab3711cf192405d21688a4b21e07b6188a (patch) | |
tree | 5450be096ade3a66edee8858fd492ff55e20e4f8 /kernel | |
parent | 3de2403e6659d71b36ec820dc9b942762ddfe6eb (diff) |
semaphore: fix
Yanmin Zhang reported:
| Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th
| regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton,
| and Itanium Montecito. Bisect located the patch below:
|
| 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
| commit 64ac24e738823161693bf791f87adc802cf529ff
| Author: Matthew Wilcox <matthew@wil.cx>
| Date: Fri Mar 7 21:55:58 2008 -0500
|
| Generic semaphore implementation
|
| After I manually reverted the patch against 2.6.26-rc1 while fixing
| lots of conflicts/errors, aim7 regression became less than 2%.
i reproduced the AIM7 workload and can confirm Yanmin's findings that
-.26-rc1 regresses over .25 - by over 67% here.
Looking at the workload i found and fixed what i believe to be the real
bug causing the AIM7 regression: it was inefficient wakeup / scheduling
/ locking behavior of the new generic semaphore code, causing suboptimal
performance.
The problem comes from the following code. The new semaphore code does
this on down():
spin_lock_irqsave(&sem->lock, flags);
if (likely(sem->count > 0))
sem->count--;
else
__down(sem);
spin_unlock_irqrestore(&sem->lock, flags);
and this on up():
spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);
where __up() does:
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);
and where __down() does this in essence:
list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
waiter.up = 0;
for (;;) {
[...]
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;
}
the fastpath looks good and obvious, but note the following property of
the contended path: if there's a task on the ->wait_list, the up() of
the current owner will "pass over" ownership to that waiting task, in a
wake-one manner, via the waiter->up flag and by removing the waiter from
the wait list.
That is all and fine in principle, but as implemented in
kernel/semaphore.c it also creates a nasty, hidden source of contention!
The contention comes from the following property of the new semaphore
code: the new owner owns the semaphore exclusively, even if it is not
running yet.
So if the old owner, even if just a few instructions later, does a
down() [lock_kernel()] again, it will be blocked and will have to wait
on the new owner to eventually be scheduled (possibly on another CPU)!
Or if another task gets to lock_kernel() sooner than the "new owner"
scheduled, it will be blocked unnecessarily and for a very long time
when there are 2000 tasks running.
I.e. the implementation of the new semaphores code does wake-one and
lock ownership in a very restrictive way - it does not allow
opportunistic re-locking of the lock at all and keeps the scheduler from
picking task order intelligently.
This kind of scheduling, with 2000 AIM7 processes running, creates awful
cross-scheduling between those 2000 tasks, causes reduced parallelism, a
throttled runqueue length and a lot of idle time. With increasing number
of CPUs it causes an exponentially worse behavior in AIM7, as the chance
for a newly woken new-owner task to actually run anytime soon is less
and less likely.
Note that it takes just a tiny bit of contention for the 'new-semaphore
catastrophy' to happen: the wakeup latencies get added to whatever small
contention there is, and quickly snowball out of control!
I believe Yanmin's findings and numbers support this analysis too.
The best fix for this problem is to use the same scheduling logic that
the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and
wanted because we do not want to over-schedule), but also allow
opportunistic locking of the lock even if a wakee is already "in
flight".
The patch below implements this new logic. With this patch applied the
AIM7 regression is largely fixed on my quad testbox:
# v2.6.25 vanilla:
..................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 56096.4 91 207.5 789.7 0.4675
2000 55894.4 94 208.2 792.7 0.4658
# v2.6.26-rc1-166-gc0a1811 vanilla:
...................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 33230.6 83 350.3 784.5 0.2769
2000 31778.1 86 366.3 783.6 0.2648
# v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
...............................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55707.1 92 209.0 795.6 0.4642
2000 55704.4 96 209.0 796.0 0.4642
i.e. a 67% speedup. We are now back to within 1% of the v2.6.25
performance levels and have zero idle time during the test, as expected.
Btw., interactivity also improved dramatically with the fix - for
example console-switching became almost instantaneous during this
workload (which after all is running 2000 tasks at once!), without the
patch it was stuck for a minute at times.
There's another nice side-effect of this speedup patch, the new generic
semaphore code got even smaller:
text data bss dec hex filename
1241 0 0 1241 4d9 semaphore.o.before
1207 0 0 1207 4b7 semaphore.o.after
(because the waiter.up complication got removed.)
Longer-term we should look into using the mutex code for the generic
semaphore code as well - but i's not easy due to legacies and it's
outside of the scope of v2.6.26 and outside the scope of this patch as
well.
Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Diffstat (limited to 'kernel')
-rw-r--r-- | kernel/semaphore.c | 64 |
1 files changed, 30 insertions, 34 deletions
diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 5c2942e768cd..5e41217239e8 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c | |||
@@ -54,10 +54,9 @@ void down(struct semaphore *sem) | |||
54 | unsigned long flags; | 54 | unsigned long flags; |
55 | 55 | ||
56 | spin_lock_irqsave(&sem->lock, flags); | 56 | spin_lock_irqsave(&sem->lock, flags); |
57 | if (likely(sem->count > 0)) | 57 | if (unlikely(!sem->count)) |
58 | sem->count--; | ||
59 | else | ||
60 | __down(sem); | 58 | __down(sem); |
59 | sem->count--; | ||
61 | spin_unlock_irqrestore(&sem->lock, flags); | 60 | spin_unlock_irqrestore(&sem->lock, flags); |
62 | } | 61 | } |
63 | EXPORT_SYMBOL(down); | 62 | EXPORT_SYMBOL(down); |
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem) | |||
77 | int result = 0; | 76 | int result = 0; |
78 | 77 | ||
79 | spin_lock_irqsave(&sem->lock, flags); | 78 | spin_lock_irqsave(&sem->lock, flags); |
80 | if (likely(sem->count > 0)) | 79 | if (unlikely(!sem->count)) |
81 | sem->count--; | ||
82 | else | ||
83 | result = __down_interruptible(sem); | 80 | result = __down_interruptible(sem); |
81 | if (!result) | ||
82 | sem->count--; | ||
84 | spin_unlock_irqrestore(&sem->lock, flags); | 83 | spin_unlock_irqrestore(&sem->lock, flags); |
85 | 84 | ||
86 | return result; | 85 | return result; |
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem) | |||
103 | int result = 0; | 102 | int result = 0; |
104 | 103 | ||
105 | spin_lock_irqsave(&sem->lock, flags); | 104 | spin_lock_irqsave(&sem->lock, flags); |
106 | if (likely(sem->count > 0)) | 105 | if (unlikely(!sem->count)) |
107 | sem->count--; | ||
108 | else | ||
109 | result = __down_killable(sem); | 106 | result = __down_killable(sem); |
107 | if (!result) | ||
108 | sem->count--; | ||
110 | spin_unlock_irqrestore(&sem->lock, flags); | 109 | spin_unlock_irqrestore(&sem->lock, flags); |
111 | 110 | ||
112 | return result; | 111 | return result; |
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, long jiffies) | |||
157 | int result = 0; | 156 | int result = 0; |
158 | 157 | ||
159 | spin_lock_irqsave(&sem->lock, flags); | 158 | spin_lock_irqsave(&sem->lock, flags); |
160 | if (likely(sem->count > 0)) | 159 | if (unlikely(!sem->count)) |
161 | sem->count--; | ||
162 | else | ||
163 | result = __down_timeout(sem, jiffies); | 160 | result = __down_timeout(sem, jiffies); |
161 | if (!result) | ||
162 | sem->count--; | ||
164 | spin_unlock_irqrestore(&sem->lock, flags); | 163 | spin_unlock_irqrestore(&sem->lock, flags); |
165 | 164 | ||
166 | return result; | 165 | return result; |
@@ -179,9 +178,8 @@ void up(struct semaphore *sem) | |||
179 | unsigned long flags; | 178 | unsigned long flags; |
180 | 179 | ||
181 | spin_lock_irqsave(&sem->lock, flags); | 180 | spin_lock_irqsave(&sem->lock, flags); |
182 | if (likely(list_empty(&sem->wait_list))) | 181 | sem->count++; |
183 | sem->count++; | 182 | if (unlikely(!list_empty(&sem->wait_list))) |
184 | else | ||
185 | __up(sem); | 183 | __up(sem); |
186 | spin_unlock_irqrestore(&sem->lock, flags); | 184 | spin_unlock_irqrestore(&sem->lock, flags); |
187 | } | 185 | } |
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up); | |||
192 | struct semaphore_waiter { | 190 | struct semaphore_waiter { |
193 | struct list_head list; | 191 | struct list_head list; |
194 | struct task_struct *task; | 192 | struct task_struct *task; |
195 | int up; | ||
196 | }; | 193 | }; |
197 | 194 | ||
198 | /* | 195 | /* |
@@ -205,33 +202,34 @@ static inline int __sched __down_common(struct semaphore *sem, long state, | |||
205 | { | 202 | { |
206 | struct task_struct *task = current; | 203 | struct task_struct *task = current; |
207 | struct semaphore_waiter waiter; | 204 | struct semaphore_waiter waiter; |
205 | int ret = 0; | ||
208 | 206 | ||
209 | list_add_tail(&waiter.list, &sem->wait_list); | ||
210 | waiter.task = task; | 207 | waiter.task = task; |
211 | waiter.up = 0; | 208 | list_add_tail(&waiter.list, &sem->wait_list); |
212 | 209 | ||
213 | for (;;) { | 210 | for (;;) { |
214 | if (state == TASK_INTERRUPTIBLE && signal_pending(task)) | 211 | if (state == TASK_INTERRUPTIBLE && signal_pending(task)) { |
215 | goto interrupted; | 212 | ret = -EINTR; |
216 | if (state == TASK_KILLABLE && fatal_signal_pending(task)) | 213 | break; |
217 | goto interrupted; | 214 | } |
218 | if (timeout <= 0) | 215 | if (state == TASK_KILLABLE && fatal_signal_pending(task)) { |
219 | goto timed_out; | 216 | ret = -EINTR; |
217 | break; | ||
218 | } | ||
219 | if (timeout <= 0) { | ||
220 | ret = -ETIME; | ||
221 | break; | ||
222 | } | ||
220 | __set_task_state(task, state); | 223 | __set_task_state(task, state); |
221 | spin_unlock_irq(&sem->lock); | 224 | spin_unlock_irq(&sem->lock); |
222 | timeout = schedule_timeout(timeout); | 225 | timeout = schedule_timeout(timeout); |
223 | spin_lock_irq(&sem->lock); | 226 | spin_lock_irq(&sem->lock); |
224 | if (waiter.up) | 227 | if (sem->count > 0) |
225 | return 0; | 228 | break; |
226 | } | 229 | } |
227 | 230 | ||
228 | timed_out: | ||
229 | list_del(&waiter.list); | ||
230 | return -ETIME; | ||
231 | |||
232 | interrupted: | ||
233 | list_del(&waiter.list); | 231 | list_del(&waiter.list); |
234 | return -EINTR; | 232 | return ret; |
235 | } | 233 | } |
236 | 234 | ||
237 | static noinline void __sched __down(struct semaphore *sem) | 235 | static noinline void __sched __down(struct semaphore *sem) |
@@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem) | |||
258 | { | 256 | { |
259 | struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, | 257 | struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, |
260 | struct semaphore_waiter, list); | 258 | struct semaphore_waiter, list); |
261 | list_del(&waiter->list); | ||
262 | waiter->up = 1; | ||
263 | wake_up_process(waiter->task); | 259 | wake_up_process(waiter->task); |
264 | } | 260 | } |