aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorIngo Molnar <mingo@elte.hu>2006-06-27 05:54:47 -0400
committerLinus Torvalds <torvalds@g5.osdl.org>2006-06-27 20:32:46 -0400
commite2970f2fb6950183a34e8545faa093eb49d186e1 (patch)
treea4035274368d846488a3b0152925502c06b064b0
parent66e5393a78b3fcca63e7748e38221dcca61c4aab (diff)
[PATCH] pi-futex: futex code cleanups
We are pleased to announce "lightweight userspace priority inheritance" (PI) support for futexes. The following patchset and glibc patch implements it, ontop of the robust-futexes patchset which is included in 2.6.16-mm1. We are calling it lightweight for 3 reasons: - in the user-space fastpath a PI-enabled futex involves no kernel work (or any other PI complexity) at all. No registration, no extra kernel calls - just pure fast atomic ops in userspace. - in the slowpath (in the lock-contention case), the system call and scheduling pattern is in fact better than that of normal futexes, due to the 'integrated' nature of FUTEX_LOCK_PI. [more about that further down] - the in-kernel PI implementation is streamlined around the mutex abstraction, with strict rules that keep the implementation relatively simple: only a single owner may own a lock (i.e. no read-write lock support), only the owner may unlock a lock, no recursive locking, etc. Priority Inheritance - why, oh why??? ------------------------------------- Many of you heard the horror stories about the evil PI code circling Linux for years, which makes no real sense at all and is only used by buggy applications and which has horrible overhead. Some of you have dreaded this very moment, when someone actually submits working PI code ;-) So why would we like to see PI support for futexes? We'd like to see it done purely for technological reasons. We dont think it's a buggy concept, we think it's useful functionality to offer to applications, which functionality cannot be achieved in other ways. We also think it's the right thing to do, and we think we've got the right arguments and the right numbers to prove that. We also believe that we can address all the counter-arguments as well. For these reasons (and the reasons outlined below) we are submitting this patch-set for upstream kernel inclusion. What are the benefits of PI? The short reply: ---------------- User-space PI helps achieving/improving determinism for user-space applications. In the best-case, it can help achieve determinism and well-bound latencies. Even in the worst-case, PI will improve the statistical distribution of locking related application delays. The longer reply: ----------------- Firstly, sharing locks between multiple tasks is a common programming technique that often cannot be replaced with lockless algorithms. As we can see it in the kernel [which is a quite complex program in itself], lockless structures are rather the exception than the norm - the current ratio of lockless vs. locky code for shared data structures is somewhere between 1:10 and 1:100. Lockless is hard, and the complexity of lockless algorithms often endangers to ability to do robust reviews of said code. I.e. critical RT apps often choose lock structures to protect critical data structures, instead of lockless algorithms. Furthermore, there are cases (like shared hardware, or other resource limits) where lockless access is mathematically impossible. Media players (such as Jack) are an example of reasonable application design with multiple tasks (with multiple priority levels) sharing short-held locks: for example, a highprio audio playback thread is combined with medium-prio construct-audio-data threads and low-prio display-colory-stuff threads. Add video and decoding to the mix and we've got even more priority levels. So once we accept that synchronization objects (locks) are an unavoidable fact of life, and once we accept that multi-task userspace apps have a very fair expectation of being able to use locks, we've got to think about how to offer the option of a deterministic locking implementation to user-space. Most of the technical counter-arguments against doing priority inheritance only apply to kernel-space locks. But user-space locks are different, there we cannot disable interrupts or make the task non-preemptible in a critical section, so the 'use spinlocks' argument does not apply (user-space spinlocks have the same priority inversion problems as other user-space locking constructs). Fact is, pretty much the only technique that currently enables good determinism for userspace locks (such as futex-based pthread mutexes) is priority inheritance: Currently (without PI), if a high-prio and a low-prio task shares a lock [this is a quite common scenario for most non-trivial RT applications], even if all critical sections are coded carefully to be deterministic (i.e. all critical sections are short in duration and only execute a limited number of instructions), the kernel cannot guarantee any deterministic execution of the high-prio task: any medium-priority task could preempt the low-prio task while it holds the shared lock and executes the critical section, and could delay it indefinitely. Implementation: --------------- As mentioned before, the userspace fastpath of PI-enabled pthread mutexes involves no kernel work at all - they behave quite similarly to normal futex-based locks: a 0 value means unlocked, and a value==TID means locked. (This is the same method as used by list-based robust futexes.) Userspace uses atomic ops to lock/unlock these mutexes without entering the kernel. To handle the slowpath, we have added two new futex ops: FUTEX_LOCK_PI FUTEX_UNLOCK_PI If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to TID fails], then FUTEX_LOCK_PI is called. The kernel does all the remaining work: if there is no futex-queue attached to the futex address yet then the code looks up the task that owns the futex [it has put its own TID into the futex value], and attaches a 'PI state' structure to the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware, kernel-based synchronization object. The 'other' task is made the owner of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the futex value. Then this task tries to lock the rt-mutex, on which it blocks. Once it returns, it has the mutex acquired, and it sets the futex value to its own TID and returns. Userspace has no other work to perform - it now owns the lock, and futex value contains FUTEX_WAITERS|TID. If the unlock side fastpath succeeds, [i.e. userspace manages to do a TID -> 0 atomic transition of the futex value], then no kernel work is triggered. If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes up any potential waiters. Note that under this approach, contrary to other PI-futex approaches, there is no prior 'registration' of a PI-futex. [which is not quite possible anyway, due to existing ABI properties of pthread mutexes.] Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties of futexes, and all four combinations are possible: futex, robust-futex, PI-futex, robust+PI-futex. glibc support: -------------- Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes (and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX mutexes. (PTHREAD_PRIO_PROTECT support will be added later on too, no additional kernel changes are needed for that). [NOTE: The glibc patch is obviously inofficial and unsupported without matching upstream kernel functionality.] the patch-queue and the glibc patch can also be downloaded from: http://redhat.com/~mingo/PI-futex-patches/ Many thanks go to the people who helped us create this kernel feature: Steven Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan van de Ven, Oleg Nesterov and others. Credits for related prior projects goes to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others. Clean up the futex code, before adding more features to it: - use u32 as the futex field type - that's the ABI - use __user and pointers to u32 instead of unsigned long - code style / comment style cleanups - rename hash-bucket name from 'bh' to 'hb'. I checked the pre and post futex.o object files to make sure this patch has no code effects. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-rw-r--r--include/linux/futex.h5
-rw-r--r--include/linux/syscalls.h4
-rw-r--r--kernel/futex.c245
-rw-r--r--kernel/futex_compat.c3
4 files changed, 135 insertions, 122 deletions
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 966a5b3da439..f05a3f469322 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -90,9 +90,8 @@ struct robust_list_head {
90 */ 90 */
91#define ROBUST_LIST_LIMIT 2048 91#define ROBUST_LIST_LIMIT 2048
92 92
93long do_futex(unsigned long uaddr, int op, int val, 93long do_futex(u32 __user *uaddr, int op, u32 val, unsigned long timeout,
94 unsigned long timeout, unsigned long uaddr2, int val2, 94 u32 __user *uaddr2, u32 val2, u32 val3);
95 int val3);
96 95
97extern int handle_futex_death(u32 __user *uaddr, struct task_struct *curr); 96extern int handle_futex_death(u32 __user *uaddr, struct task_struct *curr);
98 97
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 33785b79d548..008f04c56737 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -174,9 +174,9 @@ asmlinkage long sys_waitid(int which, pid_t pid,
174 int options, struct rusage __user *ru); 174 int options, struct rusage __user *ru);
175asmlinkage long sys_waitpid(pid_t pid, int __user *stat_addr, int options); 175asmlinkage long sys_waitpid(pid_t pid, int __user *stat_addr, int options);
176asmlinkage long sys_set_tid_address(int __user *tidptr); 176asmlinkage long sys_set_tid_address(int __user *tidptr);
177asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, 177asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
178 struct timespec __user *utime, u32 __user *uaddr2, 178 struct timespec __user *utime, u32 __user *uaddr2,
179 int val3); 179 u32 val3);
180 180
181asmlinkage long sys_init_module(void __user *umod, unsigned long len, 181asmlinkage long sys_init_module(void __user *umod, unsigned long len,
182 const char __user *uargs); 182 const char __user *uargs);
diff --git a/kernel/futex.c b/kernel/futex.c
index e1a380c77a5a..50356fb5d726 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,7 +63,7 @@ union futex_key {
63 int offset; 63 int offset;
64 } shared; 64 } shared;
65 struct { 65 struct {
66 unsigned long uaddr; 66 unsigned long address;
67 struct mm_struct *mm; 67 struct mm_struct *mm;
68 int offset; 68 int offset;
69 } private; 69 } private;
@@ -87,13 +87,13 @@ struct futex_q {
87 struct list_head list; 87 struct list_head list;
88 wait_queue_head_t waiters; 88 wait_queue_head_t waiters;
89 89
90 /* Which hash list lock to use. */ 90 /* Which hash list lock to use: */
91 spinlock_t *lock_ptr; 91 spinlock_t *lock_ptr;
92 92
93 /* Key which the futex is hashed on. */ 93 /* Key which the futex is hashed on: */
94 union futex_key key; 94 union futex_key key;
95 95
96 /* For fd, sigio sent using these. */ 96 /* For fd, sigio sent using these: */
97 int fd; 97 int fd;
98 struct file *filp; 98 struct file *filp;
99}; 99};
@@ -144,8 +144,9 @@ static inline int match_futex(union futex_key *key1, union futex_key *key2)
144 * 144 *
145 * Should be called with &current->mm->mmap_sem but NOT any spinlocks. 145 * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
146 */ 146 */
147static int get_futex_key(unsigned long uaddr, union futex_key *key) 147static int get_futex_key(u32 __user *uaddr, union futex_key *key)
148{ 148{
149 unsigned long address = (unsigned long)uaddr;
149 struct mm_struct *mm = current->mm; 150 struct mm_struct *mm = current->mm;
150 struct vm_area_struct *vma; 151 struct vm_area_struct *vma;
151 struct page *page; 152 struct page *page;
@@ -154,16 +155,16 @@ static int get_futex_key(unsigned long uaddr, union futex_key *key)
154 /* 155 /*
155 * The futex address must be "naturally" aligned. 156 * The futex address must be "naturally" aligned.
156 */ 157 */
157 key->both.offset = uaddr % PAGE_SIZE; 158 key->both.offset = address % PAGE_SIZE;
158 if (unlikely((key->both.offset % sizeof(u32)) != 0)) 159 if (unlikely((key->both.offset % sizeof(u32)) != 0))
159 return -EINVAL; 160 return -EINVAL;
160 uaddr -= key->both.offset; 161 address -= key->both.offset;
161 162
162 /* 163 /*
163 * The futex is hashed differently depending on whether 164 * The futex is hashed differently depending on whether
164 * it's in a shared or private mapping. So check vma first. 165 * it's in a shared or private mapping. So check vma first.
165 */ 166 */
166 vma = find_extend_vma(mm, uaddr); 167 vma = find_extend_vma(mm, address);
167 if (unlikely(!vma)) 168 if (unlikely(!vma))
168 return -EFAULT; 169 return -EFAULT;
169 170
@@ -184,7 +185,7 @@ static int get_futex_key(unsigned long uaddr, union futex_key *key)
184 */ 185 */
185 if (likely(!(vma->vm_flags & VM_MAYSHARE))) { 186 if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
186 key->private.mm = mm; 187 key->private.mm = mm;
187 key->private.uaddr = uaddr; 188 key->private.address = address;
188 return 0; 189 return 0;
189 } 190 }
190 191
@@ -194,7 +195,7 @@ static int get_futex_key(unsigned long uaddr, union futex_key *key)
194 key->shared.inode = vma->vm_file->f_dentry->d_inode; 195 key->shared.inode = vma->vm_file->f_dentry->d_inode;
195 key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ 196 key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
196 if (likely(!(vma->vm_flags & VM_NONLINEAR))) { 197 if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
197 key->shared.pgoff = (((uaddr - vma->vm_start) >> PAGE_SHIFT) 198 key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
198 + vma->vm_pgoff); 199 + vma->vm_pgoff);
199 return 0; 200 return 0;
200 } 201 }
@@ -205,7 +206,7 @@ static int get_futex_key(unsigned long uaddr, union futex_key *key)
205 * from swap. But that's a lot of code to duplicate here 206 * from swap. But that's a lot of code to duplicate here
206 * for a rare case, so we simply fetch the page. 207 * for a rare case, so we simply fetch the page.
207 */ 208 */
208 err = get_user_pages(current, mm, uaddr, 1, 0, 0, &page, NULL); 209 err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
209 if (err >= 0) { 210 if (err >= 0) {
210 key->shared.pgoff = 211 key->shared.pgoff =
211 page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); 212 page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -246,12 +247,12 @@ static void drop_key_refs(union futex_key *key)
246 } 247 }
247} 248}
248 249
249static inline int get_futex_value_locked(int *dest, int __user *from) 250static inline int get_futex_value_locked(u32 *dest, u32 __user *from)
250{ 251{
251 int ret; 252 int ret;
252 253
253 inc_preempt_count(); 254 inc_preempt_count();
254 ret = __copy_from_user_inatomic(dest, from, sizeof(int)); 255 ret = __copy_from_user_inatomic(dest, from, sizeof(u32));
255 dec_preempt_count(); 256 dec_preempt_count();
256 257
257 return ret ? -EFAULT : 0; 258 return ret ? -EFAULT : 0;
@@ -288,12 +289,12 @@ static void wake_futex(struct futex_q *q)
288 * Wake up all waiters hashed on the physical page that is mapped 289 * Wake up all waiters hashed on the physical page that is mapped
289 * to this virtual address: 290 * to this virtual address:
290 */ 291 */
291static int futex_wake(unsigned long uaddr, int nr_wake) 292static int futex_wake(u32 __user *uaddr, int nr_wake)
292{ 293{
293 union futex_key key; 294 struct futex_hash_bucket *hb;
294 struct futex_hash_bucket *bh;
295 struct list_head *head;
296 struct futex_q *this, *next; 295 struct futex_q *this, *next;
296 struct list_head *head;
297 union futex_key key;
297 int ret; 298 int ret;
298 299
299 down_read(&current->mm->mmap_sem); 300 down_read(&current->mm->mmap_sem);
@@ -302,9 +303,9 @@ static int futex_wake(unsigned long uaddr, int nr_wake)
302 if (unlikely(ret != 0)) 303 if (unlikely(ret != 0))
303 goto out; 304 goto out;
304 305
305 bh = hash_futex(&key); 306 hb = hash_futex(&key);
306 spin_lock(&bh->lock); 307 spin_lock(&hb->lock);
307 head = &bh->chain; 308 head = &hb->chain;
308 309
309 list_for_each_entry_safe(this, next, head, list) { 310 list_for_each_entry_safe(this, next, head, list) {
310 if (match_futex (&this->key, &key)) { 311 if (match_futex (&this->key, &key)) {
@@ -314,7 +315,7 @@ static int futex_wake(unsigned long uaddr, int nr_wake)
314 } 315 }
315 } 316 }
316 317
317 spin_unlock(&bh->lock); 318 spin_unlock(&hb->lock);
318out: 319out:
319 up_read(&current->mm->mmap_sem); 320 up_read(&current->mm->mmap_sem);
320 return ret; 321 return ret;
@@ -324,10 +325,12 @@ out:
324 * Wake up all waiters hashed on the physical page that is mapped 325 * Wake up all waiters hashed on the physical page that is mapped
325 * to this virtual address: 326 * to this virtual address:
326 */ 327 */
327static int futex_wake_op(unsigned long uaddr1, unsigned long uaddr2, int nr_wake, int nr_wake2, int op) 328static int
329futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2,
330 int nr_wake, int nr_wake2, int op)
328{ 331{
329 union futex_key key1, key2; 332 union futex_key key1, key2;
330 struct futex_hash_bucket *bh1, *bh2; 333 struct futex_hash_bucket *hb1, *hb2;
331 struct list_head *head; 334 struct list_head *head;
332 struct futex_q *this, *next; 335 struct futex_q *this, *next;
333 int ret, op_ret, attempt = 0; 336 int ret, op_ret, attempt = 0;
@@ -342,27 +345,29 @@ retryfull:
342 if (unlikely(ret != 0)) 345 if (unlikely(ret != 0))
343 goto out; 346 goto out;
344 347
345 bh1 = hash_futex(&key1); 348 hb1 = hash_futex(&key1);
346 bh2 = hash_futex(&key2); 349 hb2 = hash_futex(&key2);
347 350
348retry: 351retry:
349 if (bh1 < bh2) 352 if (hb1 < hb2)
350 spin_lock(&bh1->lock); 353 spin_lock(&hb1->lock);
351 spin_lock(&bh2->lock); 354 spin_lock(&hb2->lock);
352 if (bh1 > bh2) 355 if (hb1 > hb2)
353 spin_lock(&bh1->lock); 356 spin_lock(&hb1->lock);
354 357
355 op_ret = futex_atomic_op_inuser(op, (int __user *)uaddr2); 358 op_ret = futex_atomic_op_inuser(op, uaddr2);
356 if (unlikely(op_ret < 0)) { 359 if (unlikely(op_ret < 0)) {
357 int dummy; 360 u32 dummy;
358 361
359 spin_unlock(&bh1->lock); 362 spin_unlock(&hb1->lock);
360 if (bh1 != bh2) 363 if (hb1 != hb2)
361 spin_unlock(&bh2->lock); 364 spin_unlock(&hb2->lock);
362 365
363#ifndef CONFIG_MMU 366#ifndef CONFIG_MMU
364 /* we don't get EFAULT from MMU faults if we don't have an MMU, 367 /*
365 * but we might get them from range checking */ 368 * we don't get EFAULT from MMU faults if we don't have an MMU,
369 * but we might get them from range checking
370 */
366 ret = op_ret; 371 ret = op_ret;
367 goto out; 372 goto out;
368#endif 373#endif
@@ -372,23 +377,26 @@ retry:
372 goto out; 377 goto out;
373 } 378 }
374 379
375 /* futex_atomic_op_inuser needs to both read and write 380 /*
381 * futex_atomic_op_inuser needs to both read and write
376 * *(int __user *)uaddr2, but we can't modify it 382 * *(int __user *)uaddr2, but we can't modify it
377 * non-atomically. Therefore, if get_user below is not 383 * non-atomically. Therefore, if get_user below is not
378 * enough, we need to handle the fault ourselves, while 384 * enough, we need to handle the fault ourselves, while
379 * still holding the mmap_sem. */ 385 * still holding the mmap_sem.
386 */
380 if (attempt++) { 387 if (attempt++) {
381 struct vm_area_struct * vma; 388 struct vm_area_struct * vma;
382 struct mm_struct *mm = current->mm; 389 struct mm_struct *mm = current->mm;
390 unsigned long address = (unsigned long)uaddr2;
383 391
384 ret = -EFAULT; 392 ret = -EFAULT;
385 if (attempt >= 2 || 393 if (attempt >= 2 ||
386 !(vma = find_vma(mm, uaddr2)) || 394 !(vma = find_vma(mm, address)) ||
387 vma->vm_start > uaddr2 || 395 vma->vm_start > address ||
388 !(vma->vm_flags & VM_WRITE)) 396 !(vma->vm_flags & VM_WRITE))
389 goto out; 397 goto out;
390 398
391 switch (handle_mm_fault(mm, vma, uaddr2, 1)) { 399 switch (handle_mm_fault(mm, vma, address, 1)) {
392 case VM_FAULT_MINOR: 400 case VM_FAULT_MINOR:
393 current->min_flt++; 401 current->min_flt++;
394 break; 402 break;
@@ -401,18 +409,20 @@ retry:
401 goto retry; 409 goto retry;
402 } 410 }
403 411
404 /* If we would have faulted, release mmap_sem, 412 /*
405 * fault it in and start all over again. */ 413 * If we would have faulted, release mmap_sem,
414 * fault it in and start all over again.
415 */
406 up_read(&current->mm->mmap_sem); 416 up_read(&current->mm->mmap_sem);
407 417
408 ret = get_user(dummy, (int __user *)uaddr2); 418 ret = get_user(dummy, uaddr2);
409 if (ret) 419 if (ret)
410 return ret; 420 return ret;
411 421
412 goto retryfull; 422 goto retryfull;
413 } 423 }
414 424
415 head = &bh1->chain; 425 head = &hb1->chain;
416 426
417 list_for_each_entry_safe(this, next, head, list) { 427 list_for_each_entry_safe(this, next, head, list) {
418 if (match_futex (&this->key, &key1)) { 428 if (match_futex (&this->key, &key1)) {
@@ -423,7 +433,7 @@ retry:
423 } 433 }
424 434
425 if (op_ret > 0) { 435 if (op_ret > 0) {
426 head = &bh2->chain; 436 head = &hb2->chain;
427 437
428 op_ret = 0; 438 op_ret = 0;
429 list_for_each_entry_safe(this, next, head, list) { 439 list_for_each_entry_safe(this, next, head, list) {
@@ -436,9 +446,9 @@ retry:
436 ret += op_ret; 446 ret += op_ret;
437 } 447 }
438 448
439 spin_unlock(&bh1->lock); 449 spin_unlock(&hb1->lock);
440 if (bh1 != bh2) 450 if (hb1 != hb2)
441 spin_unlock(&bh2->lock); 451 spin_unlock(&hb2->lock);
442out: 452out:
443 up_read(&current->mm->mmap_sem); 453 up_read(&current->mm->mmap_sem);
444 return ret; 454 return ret;
@@ -448,11 +458,11 @@ out:
448 * Requeue all waiters hashed on one physical page to another 458 * Requeue all waiters hashed on one physical page to another
449 * physical page. 459 * physical page.
450 */ 460 */
451static int futex_requeue(unsigned long uaddr1, unsigned long uaddr2, 461static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2,
452 int nr_wake, int nr_requeue, int *valp) 462 int nr_wake, int nr_requeue, u32 *cmpval)
453{ 463{
454 union futex_key key1, key2; 464 union futex_key key1, key2;
455 struct futex_hash_bucket *bh1, *bh2; 465 struct futex_hash_bucket *hb1, *hb2;
456 struct list_head *head1; 466 struct list_head *head1;
457 struct futex_q *this, *next; 467 struct futex_q *this, *next;
458 int ret, drop_count = 0; 468 int ret, drop_count = 0;
@@ -467,68 +477,69 @@ static int futex_requeue(unsigned long uaddr1, unsigned long uaddr2,
467 if (unlikely(ret != 0)) 477 if (unlikely(ret != 0))
468 goto out; 478 goto out;
469 479
470 bh1 = hash_futex(&key1); 480 hb1 = hash_futex(&key1);
471 bh2 = hash_futex(&key2); 481 hb2 = hash_futex(&key2);
472 482
473 if (bh1 < bh2) 483 if (hb1 < hb2)
474 spin_lock(&bh1->lock); 484 spin_lock(&hb1->lock);
475 spin_lock(&bh2->lock); 485 spin_lock(&hb2->lock);
476 if (bh1 > bh2) 486 if (hb1 > hb2)
477 spin_lock(&bh1->lock); 487 spin_lock(&hb1->lock);
478 488
479 if (likely(valp != NULL)) { 489 if (likely(cmpval != NULL)) {
480 int curval; 490 u32 curval;
481 491
482 ret = get_futex_value_locked(&curval, (int __user *)uaddr1); 492 ret = get_futex_value_locked(&curval, uaddr1);
483 493
484 if (unlikely(ret)) { 494 if (unlikely(ret)) {
485 spin_unlock(&bh1->lock); 495 spin_unlock(&hb1->lock);
486 if (bh1 != bh2) 496 if (hb1 != hb2)
487 spin_unlock(&bh2->lock); 497 spin_unlock(&hb2->lock);
488 498
489 /* If we would have faulted, release mmap_sem, fault 499 /*
500 * If we would have faulted, release mmap_sem, fault
490 * it in and start all over again. 501 * it in and start all over again.
491 */ 502 */
492 up_read(&current->mm->mmap_sem); 503 up_read(&current->mm->mmap_sem);
493 504
494 ret = get_user(curval, (int __user *)uaddr1); 505 ret = get_user(curval, uaddr1);
495 506
496 if (!ret) 507 if (!ret)
497 goto retry; 508 goto retry;
498 509
499 return ret; 510 return ret;
500 } 511 }
501 if (curval != *valp) { 512 if (curval != *cmpval) {
502 ret = -EAGAIN; 513 ret = -EAGAIN;
503 goto out_unlock; 514 goto out_unlock;
504 } 515 }
505 } 516 }
506 517
507 head1 = &bh1->chain; 518 head1 = &hb1->chain;
508 list_for_each_entry_safe(this, next, head1, list) { 519 list_for_each_entry_safe(this, next, head1, list) {
509 if (!match_futex (&this->key, &key1)) 520 if (!match_futex (&this->key, &key1))
510 continue; 521 continue;
511 if (++ret <= nr_wake) { 522 if (++ret <= nr_wake) {
512 wake_futex(this); 523 wake_futex(this);
513 } else { 524 } else {
514 list_move_tail(&this->list, &bh2->chain); 525 list_move_tail(&this->list, &hb2->chain);
515 this->lock_ptr = &bh2->lock; 526 this->lock_ptr = &hb2->lock;
516 this->key = key2; 527 this->key = key2;
517 get_key_refs(&key2); 528 get_key_refs(&key2);
518 drop_count++; 529 drop_count++;
519 530
520 if (ret - nr_wake >= nr_requeue) 531 if (ret - nr_wake >= nr_requeue)
521 break; 532 break;
522 /* Make sure to stop if key1 == key2 */ 533 /* Make sure to stop if key1 == key2: */
523 if (head1 == &bh2->chain && head1 != &next->list) 534 if (head1 == &hb2->chain && head1 != &next->list)
524 head1 = &this->list; 535 head1 = &this->list;
525 } 536 }
526 } 537 }
527 538
528out_unlock: 539out_unlock:
529 spin_unlock(&bh1->lock); 540 spin_unlock(&hb1->lock);
530 if (bh1 != bh2) 541 if (hb1 != hb2)
531 spin_unlock(&bh2->lock); 542 spin_unlock(&hb2->lock);
532 543
533 /* drop_key_refs() must be called outside the spinlocks. */ 544 /* drop_key_refs() must be called outside the spinlocks. */
534 while (--drop_count >= 0) 545 while (--drop_count >= 0)
@@ -543,7 +554,7 @@ out:
543static inline struct futex_hash_bucket * 554static inline struct futex_hash_bucket *
544queue_lock(struct futex_q *q, int fd, struct file *filp) 555queue_lock(struct futex_q *q, int fd, struct file *filp)
545{ 556{
546 struct futex_hash_bucket *bh; 557 struct futex_hash_bucket *hb;
547 558
548 q->fd = fd; 559 q->fd = fd;
549 q->filp = filp; 560 q->filp = filp;
@@ -551,23 +562,23 @@ queue_lock(struct futex_q *q, int fd, struct file *filp)
551 init_waitqueue_head(&q->waiters); 562 init_waitqueue_head(&q->waiters);
552 563
553 get_key_refs(&q->key); 564 get_key_refs(&q->key);
554 bh = hash_futex(&q->key); 565 hb = hash_futex(&q->key);
555 q->lock_ptr = &bh->lock; 566 q->lock_ptr = &hb->lock;
556 567
557 spin_lock(&bh->lock); 568 spin_lock(&hb->lock);
558 return bh; 569 return hb;
559} 570}
560 571
561static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh) 572static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
562{ 573{
563 list_add_tail(&q->list, &bh->chain); 574 list_add_tail(&q->list, &hb->chain);
564 spin_unlock(&bh->lock); 575 spin_unlock(&hb->lock);
565} 576}
566 577
567static inline void 578static inline void
568queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh) 579queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
569{ 580{
570 spin_unlock(&bh->lock); 581 spin_unlock(&hb->lock);
571 drop_key_refs(&q->key); 582 drop_key_refs(&q->key);
572} 583}
573 584
@@ -579,16 +590,17 @@ queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh)
579/* The key must be already stored in q->key. */ 590/* The key must be already stored in q->key. */
580static void queue_me(struct futex_q *q, int fd, struct file *filp) 591static void queue_me(struct futex_q *q, int fd, struct file *filp)
581{ 592{
582 struct futex_hash_bucket *bh; 593 struct futex_hash_bucket *hb;
583 bh = queue_lock(q, fd, filp); 594
584 __queue_me(q, bh); 595 hb = queue_lock(q, fd, filp);
596 __queue_me(q, hb);
585} 597}
586 598
587/* Return 1 if we were still queued (ie. 0 means we were woken) */ 599/* Return 1 if we were still queued (ie. 0 means we were woken) */
588static int unqueue_me(struct futex_q *q) 600static int unqueue_me(struct futex_q *q)
589{ 601{
590 int ret = 0;
591 spinlock_t *lock_ptr; 602 spinlock_t *lock_ptr;
603 int ret = 0;
592 604
593 /* In the common case we don't take the spinlock, which is nice. */ 605 /* In the common case we don't take the spinlock, which is nice. */
594 retry: 606 retry:
@@ -622,12 +634,13 @@ static int unqueue_me(struct futex_q *q)
622 return ret; 634 return ret;
623} 635}
624 636
625static int futex_wait(unsigned long uaddr, int val, unsigned long time) 637static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
626{ 638{
627 DECLARE_WAITQUEUE(wait, current); 639 DECLARE_WAITQUEUE(wait, current);
628 int ret, curval; 640 struct futex_hash_bucket *hb;
629 struct futex_q q; 641 struct futex_q q;
630 struct futex_hash_bucket *bh; 642 u32 uval;
643 int ret;
631 644
632 retry: 645 retry:
633 down_read(&current->mm->mmap_sem); 646 down_read(&current->mm->mmap_sem);
@@ -636,7 +649,7 @@ static int futex_wait(unsigned long uaddr, int val, unsigned long time)
636 if (unlikely(ret != 0)) 649 if (unlikely(ret != 0))
637 goto out_release_sem; 650 goto out_release_sem;
638 651
639 bh = queue_lock(&q, -1, NULL); 652 hb = queue_lock(&q, -1, NULL);
640 653
641 /* 654 /*
642 * Access the page AFTER the futex is queued. 655 * Access the page AFTER the futex is queued.
@@ -658,31 +671,31 @@ static int futex_wait(unsigned long uaddr, int val, unsigned long time)
658 * We hold the mmap semaphore, so the mapping cannot have changed 671 * We hold the mmap semaphore, so the mapping cannot have changed
659 * since we looked it up in get_futex_key. 672 * since we looked it up in get_futex_key.
660 */ 673 */
661 674 ret = get_futex_value_locked(&uval, uaddr);
662 ret = get_futex_value_locked(&curval, (int __user *)uaddr);
663 675
664 if (unlikely(ret)) { 676 if (unlikely(ret)) {
665 queue_unlock(&q, bh); 677 queue_unlock(&q, hb);
666 678
667 /* If we would have faulted, release mmap_sem, fault it in and 679 /*
680 * If we would have faulted, release mmap_sem, fault it in and
668 * start all over again. 681 * start all over again.
669 */ 682 */
670 up_read(&current->mm->mmap_sem); 683 up_read(&current->mm->mmap_sem);
671 684
672 ret = get_user(curval, (int __user *)uaddr); 685 ret = get_user(uval, uaddr);
673 686
674 if (!ret) 687 if (!ret)
675 goto retry; 688 goto retry;
676 return ret; 689 return ret;
677 } 690 }
678 if (curval != val) { 691 if (uval != val) {
679 ret = -EWOULDBLOCK; 692 ret = -EWOULDBLOCK;
680 queue_unlock(&q, bh); 693 queue_unlock(&q, hb);
681 goto out_release_sem; 694 goto out_release_sem;
682 } 695 }
683 696
684 /* Only actually queue if *uaddr contained val. */ 697 /* Only actually queue if *uaddr contained val. */
685 __queue_me(&q, bh); 698 __queue_me(&q, hb);
686 699
687 /* 700 /*
688 * Now the futex is queued and we have checked the data, we 701 * Now the futex is queued and we have checked the data, we
@@ -720,8 +733,10 @@ static int futex_wait(unsigned long uaddr, int val, unsigned long time)
720 return 0; 733 return 0;
721 if (time == 0) 734 if (time == 0)
722 return -ETIMEDOUT; 735 return -ETIMEDOUT;
723 /* We expect signal_pending(current), but another thread may 736 /*
724 * have handled it for us already. */ 737 * We expect signal_pending(current), but another thread may
738 * have handled it for us already.
739 */
725 return -EINTR; 740 return -EINTR;
726 741
727 out_release_sem: 742 out_release_sem:
@@ -735,6 +750,7 @@ static int futex_close(struct inode *inode, struct file *filp)
735 750
736 unqueue_me(q); 751 unqueue_me(q);
737 kfree(q); 752 kfree(q);
753
738 return 0; 754 return 0;
739} 755}
740 756
@@ -766,7 +782,7 @@ static struct file_operations futex_fops = {
766 * Signal allows caller to avoid the race which would occur if they 782 * Signal allows caller to avoid the race which would occur if they
767 * set the sigio stuff up afterwards. 783 * set the sigio stuff up afterwards.
768 */ 784 */
769static int futex_fd(unsigned long uaddr, int signal) 785static int futex_fd(u32 __user *uaddr, int signal)
770{ 786{
771 struct futex_q *q; 787 struct futex_q *q;
772 struct file *filp; 788 struct file *filp;
@@ -937,7 +953,7 @@ retry:
937 goto retry; 953 goto retry;
938 954
939 if (uval & FUTEX_WAITERS) 955 if (uval & FUTEX_WAITERS)
940 futex_wake((unsigned long)uaddr, 1); 956 futex_wake(uaddr, 1);
941 } 957 }
942 return 0; 958 return 0;
943} 959}
@@ -999,8 +1015,8 @@ void exit_robust_list(struct task_struct *curr)
999 } 1015 }
1000} 1016}
1001 1017
1002long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, 1018long do_futex(u32 __user *uaddr, int op, u32 val, unsigned long timeout,
1003 unsigned long uaddr2, int val2, int val3) 1019 u32 __user *uaddr2, u32 val2, u32 val3)
1004{ 1020{
1005 int ret; 1021 int ret;
1006 1022
@@ -1031,13 +1047,13 @@ long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout,
1031} 1047}
1032 1048
1033 1049
1034asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, 1050asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
1035 struct timespec __user *utime, u32 __user *uaddr2, 1051 struct timespec __user *utime, u32 __user *uaddr2,
1036 int val3) 1052 u32 val3)
1037{ 1053{
1038 struct timespec t; 1054 struct timespec t;
1039 unsigned long timeout = MAX_SCHEDULE_TIMEOUT; 1055 unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
1040 int val2 = 0; 1056 u32 val2 = 0;
1041 1057
1042 if (utime && (op == FUTEX_WAIT)) { 1058 if (utime && (op == FUTEX_WAIT)) {
1043 if (copy_from_user(&t, utime, sizeof(t)) != 0) 1059 if (copy_from_user(&t, utime, sizeof(t)) != 0)
@@ -1050,10 +1066,9 @@ asmlinkage long sys_futex(u32 __user *uaddr, int op, int val,
1050 * requeue parameter in 'utime' if op == FUTEX_REQUEUE. 1066 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
1051 */ 1067 */
1052 if (op >= FUTEX_REQUEUE) 1068 if (op >= FUTEX_REQUEUE)
1053 val2 = (int) (unsigned long) utime; 1069 val2 = (u32) (unsigned long) utime;
1054 1070
1055 return do_futex((unsigned long)uaddr, op, val, timeout, 1071 return do_futex(uaddr, op, val, timeout, uaddr2, val2, val3);
1056 (unsigned long)uaddr2, val2, val3);
1057} 1072}
1058 1073
1059static int futexfs_get_sb(struct file_system_type *fs_type, 1074static int futexfs_get_sb(struct file_system_type *fs_type,
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 1ab6a0ea3d14..7e57c31670a3 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -139,6 +139,5 @@ asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, u32 val,
139 if (op >= FUTEX_REQUEUE) 139 if (op >= FUTEX_REQUEUE)
140 val2 = (int) (unsigned long) utime; 140 val2 = (int) (unsigned long) utime;
141 141
142 return do_futex((unsigned long)uaddr, op, val, timeout, 142 return do_futex(uaddr, op, val, timeout, uaddr2, val2, val3);
143 (unsigned long)uaddr2, val2, val3);
144} 143}