aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorIngo Molnar <mingo@elte.hu>2008-05-08 05:53:48 -0400
committerIngo Molnar <mingo@elte.hu>2008-05-08 11:00:42 -0400
commitbf726eab3711cf192405d21688a4b21e07b6188a (patch)
tree5450be096ade3a66edee8858fd492ff55e20e4f8
parent3de2403e6659d71b36ec820dc9b942762ddfe6eb (diff)
semaphore: fix
Yanmin Zhang reported: | Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th | regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, | and Itanium Montecito. Bisect located the patch below: | | 64ac24e738823161693bf791f87adc802cf529ff is first bad commit | commit 64ac24e738823161693bf791f87adc802cf529ff | Author: Matthew Wilcox <matthew@wil.cx> | Date: Fri Mar 7 21:55:58 2008 -0500 | | Generic semaphore implementation | | After I manually reverted the patch against 2.6.26-rc1 while fixing | lots of conflicts/errors, aim7 regression became less than 2%. i reproduced the AIM7 workload and can confirm Yanmin's findings that -.26-rc1 regresses over .25 - by over 67% here. Looking at the workload i found and fixed what i believe to be the real bug causing the AIM7 regression: it was inefficient wakeup / scheduling / locking behavior of the new generic semaphore code, causing suboptimal performance. The problem comes from the following code. The new semaphore code does this on down(): spin_lock_irqsave(&sem->lock, flags); if (likely(sem->count > 0)) sem->count--; else __down(sem); spin_unlock_irqrestore(&sem->lock, flags); and this on up(): spin_lock_irqsave(&sem->lock, flags); if (likely(list_empty(&sem->wait_list))) sem->count++; else __up(sem); spin_unlock_irqrestore(&sem->lock, flags); where __up() does: list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); and where __down() does this in essence: list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; waiter.up = 0; for (;;) { [...] spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); if (waiter.up) return 0; } the fastpath looks good and obvious, but note the following property of the contended path: if there's a task on the ->wait_list, the up() of the current owner will "pass over" ownership to that waiting task, in a wake-one manner, via the waiter->up flag and by removing the waiter from the wait list. That is all and fine in principle, but as implemented in kernel/semaphore.c it also creates a nasty, hidden source of contention! The contention comes from the following property of the new semaphore code: the new owner owns the semaphore exclusively, even if it is not running yet. So if the old owner, even if just a few instructions later, does a down() [lock_kernel()] again, it will be blocked and will have to wait on the new owner to eventually be scheduled (possibly on another CPU)! Or if another task gets to lock_kernel() sooner than the "new owner" scheduled, it will be blocked unnecessarily and for a very long time when there are 2000 tasks running. I.e. the implementation of the new semaphores code does wake-one and lock ownership in a very restrictive way - it does not allow opportunistic re-locking of the lock at all and keeps the scheduler from picking task order intelligently. This kind of scheduling, with 2000 AIM7 processes running, creates awful cross-scheduling between those 2000 tasks, causes reduced parallelism, a throttled runqueue length and a lot of idle time. With increasing number of CPUs it causes an exponentially worse behavior in AIM7, as the chance for a newly woken new-owner task to actually run anytime soon is less and less likely. Note that it takes just a tiny bit of contention for the 'new-semaphore catastrophy' to happen: the wakeup latencies get added to whatever small contention there is, and quickly snowball out of control! I believe Yanmin's findings and numbers support this analysis too. The best fix for this problem is to use the same scheduling logic that the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and wanted because we do not want to over-schedule), but also allow opportunistic locking of the lock even if a wakee is already "in flight". The patch below implements this new logic. With this patch applied the AIM7 regression is largely fixed on my quad testbox: # v2.6.25 vanilla: .................. Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 56096.4 91 207.5 789.7 0.4675 2000 55894.4 94 208.2 792.7 0.4658 # v2.6.26-rc1-166-gc0a1811 vanilla: ................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 33230.6 83 350.3 784.5 0.2769 2000 31778.1 86 366.3 783.6 0.2648 # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup: ............................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55707.1 92 209.0 795.6 0.4642 2000 55704.4 96 209.0 796.0 0.4642 i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 performance levels and have zero idle time during the test, as expected. Btw., interactivity also improved dramatically with the fix - for example console-switching became almost instantaneous during this workload (which after all is running 2000 tasks at once!), without the patch it was stuck for a minute at times. There's another nice side-effect of this speedup patch, the new generic semaphore code got even smaller: text data bss dec hex filename 1241 0 0 1241 4d9 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after (because the waiter.up complication got removed.) Longer-term we should look into using the mutex code for the generic semaphore code as well - but i's not easy due to legacies and it's outside of the scope of v2.6.26 and outside the scope of this patch as well. Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
-rw-r--r--kernel/semaphore.c64
1 files changed, 30 insertions, 34 deletions
diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e768cd..5e41217239e8 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
54 unsigned long flags; 54 unsigned long flags;
55 55
56 spin_lock_irqsave(&sem->lock, flags); 56 spin_lock_irqsave(&sem->lock, flags);
57 if (likely(sem->count > 0)) 57 if (unlikely(!sem->count))
58 sem->count--;
59 else
60 __down(sem); 58 __down(sem);
59 sem->count--;
61 spin_unlock_irqrestore(&sem->lock, flags); 60 spin_unlock_irqrestore(&sem->lock, flags);
62} 61}
63EXPORT_SYMBOL(down); 62EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem)
77 int result = 0; 76 int result = 0;
78 77
79 spin_lock_irqsave(&sem->lock, flags); 78 spin_lock_irqsave(&sem->lock, flags);
80 if (likely(sem->count > 0)) 79 if (unlikely(!sem->count))
81 sem->count--;
82 else
83 result = __down_interruptible(sem); 80 result = __down_interruptible(sem);
81 if (!result)
82 sem->count--;
84 spin_unlock_irqrestore(&sem->lock, flags); 83 spin_unlock_irqrestore(&sem->lock, flags);
85 84
86 return result; 85 return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)
103 int result = 0; 102 int result = 0;
104 103
105 spin_lock_irqsave(&sem->lock, flags); 104 spin_lock_irqsave(&sem->lock, flags);
106 if (likely(sem->count > 0)) 105 if (unlikely(!sem->count))
107 sem->count--;
108 else
109 result = __down_killable(sem); 106 result = __down_killable(sem);
107 if (!result)
108 sem->count--;
110 spin_unlock_irqrestore(&sem->lock, flags); 109 spin_unlock_irqrestore(&sem->lock, flags);
111 110
112 return result; 111 return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, long jiffies)
157 int result = 0; 156 int result = 0;
158 157
159 spin_lock_irqsave(&sem->lock, flags); 158 spin_lock_irqsave(&sem->lock, flags);
160 if (likely(sem->count > 0)) 159 if (unlikely(!sem->count))
161 sem->count--;
162 else
163 result = __down_timeout(sem, jiffies); 160 result = __down_timeout(sem, jiffies);
161 if (!result)
162 sem->count--;
164 spin_unlock_irqrestore(&sem->lock, flags); 163 spin_unlock_irqrestore(&sem->lock, flags);
165 164
166 return result; 165 return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)
179 unsigned long flags; 178 unsigned long flags;
180 179
181 spin_lock_irqsave(&sem->lock, flags); 180 spin_lock_irqsave(&sem->lock, flags);
182 if (likely(list_empty(&sem->wait_list))) 181 sem->count++;
183 sem->count++; 182 if (unlikely(!list_empty(&sem->wait_list)))
184 else
185 __up(sem); 183 __up(sem);
186 spin_unlock_irqrestore(&sem->lock, flags); 184 spin_unlock_irqrestore(&sem->lock, flags);
187} 185}
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
192struct semaphore_waiter { 190struct semaphore_waiter {
193 struct list_head list; 191 struct list_head list;
194 struct task_struct *task; 192 struct task_struct *task;
195 int up;
196}; 193};
197 194
198/* 195/*
@@ -205,33 +202,34 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
205{ 202{
206 struct task_struct *task = current; 203 struct task_struct *task = current;
207 struct semaphore_waiter waiter; 204 struct semaphore_waiter waiter;
205 int ret = 0;
208 206
209 list_add_tail(&waiter.list, &sem->wait_list);
210 waiter.task = task; 207 waiter.task = task;
211 waiter.up = 0; 208 list_add_tail(&waiter.list, &sem->wait_list);
212 209
213 for (;;) { 210 for (;;) {
214 if (state == TASK_INTERRUPTIBLE && signal_pending(task)) 211 if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
215 goto interrupted; 212 ret = -EINTR;
216 if (state == TASK_KILLABLE && fatal_signal_pending(task)) 213 break;
217 goto interrupted; 214 }
218 if (timeout <= 0) 215 if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
219 goto timed_out; 216 ret = -EINTR;
217 break;
218 }
219 if (timeout <= 0) {
220 ret = -ETIME;
221 break;
222 }
220 __set_task_state(task, state); 223 __set_task_state(task, state);
221 spin_unlock_irq(&sem->lock); 224 spin_unlock_irq(&sem->lock);
222 timeout = schedule_timeout(timeout); 225 timeout = schedule_timeout(timeout);
223 spin_lock_irq(&sem->lock); 226 spin_lock_irq(&sem->lock);
224 if (waiter.up) 227 if (sem->count > 0)
225 return 0; 228 break;
226 } 229 }
227 230
228 timed_out:
229 list_del(&waiter.list);
230 return -ETIME;
231
232 interrupted:
233 list_del(&waiter.list); 231 list_del(&waiter.list);
234 return -EINTR; 232 return ret;
235} 233}
236 234
237static noinline void __sched __down(struct semaphore *sem) 235static noinline void __sched __down(struct semaphore *sem)
@@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem)
258{ 256{
259 struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, 257 struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
260 struct semaphore_waiter, list); 258 struct semaphore_waiter, list);
261 list_del(&waiter->list);
262 waiter->up = 1;
263 wake_up_process(waiter->task); 259 wake_up_process(waiter->task);
264} 260}