aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/futex.c
Commit message (Collapse)AuthorAge
* futex: setup writeable mapping for futex ops which modify user space dataThomas Gleixner2009-05-19
| | | | | | | | | | | | | | | | | | | | | | | | The futex code installs a read only mapping via get_user_pages_fast() even if the futex op function has to modify user space data. The eventual fault was fixed up by futex_handle_fault() which walked the VMA with mmap_sem held. After the cleanup patches which removed the mmap_sem dependency of the futex code commit 4dc5b7a36a49eff97050894cf1b3a9a02523717 (futex: clean up fault logic) removed the private VMA walk logic from the futex code. This change results in a stale RO mapping which is not fixed up. Instead of reintroducing the previous fault logic we set up the mapping in get_user_pages_fast() read/write for all operations which modify user space data. Also handle private futexes in the same way and make the current unconditional access_ok(VERIFY_WRITE) depend on the futex op. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> CC: stable@kernel.org
* futex: comment requeue key reference semanticsDarren Hart2009-04-02
| | | | | | | | | | | | | We've tripped over the futex_requeue drop_count refering to key2 instead of key1. The code is actually correct, but is non-intuitive. This patch adds an explicit comment explaining the requeue. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: remove the pointer math from double_unlock_hb, fixIngo Molnar2009-03-13
| | | | | | | | | | | | | Impact: fix double unlock crash Thomas Gleixner noticed that the simplified double_unlock_hb() became ... too unsophisticated: in the hb1 == hb2 case it will do a double unlock. Reported-by: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <20090312221118.11146.68610.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: remove the pointer math from double_unlock_hbDarren Hart2009-03-12
| | | | | | | | | | | | | | Impact: simplify code I mistakenly included the pointer value ordering in the double_unlock_hb() in my previous patch. It's only necessary in the double_lock_hb() function. This patch removes it. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312221118.11146.68610.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: clean up fault logicDarren Hart2009-03-12
| | | | | | | | | | | | | | | | | | | | | Impact: cleanup Older versions of the futex code held the mmap_sem which had to be dropped in order to call get_user(), so a two-pronged fault handling mechanism was employed to handle faults of the atomic operations. The mmap_sem is no longer held, so get_user() should be adequate. This patch greatly simplifies the logic and improves legibility. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075612.9856.48612.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: unlock before returning -EFAULTDarren Hart2009-03-12
| | | | | | | | | | | | | | | | | | | | | Impact: rt-mutex failure case fix futex_lock_pi can potentially return -EFAULT with the rt_mutex held. This seems like the wrong thing to do as userspace should assume -EFAULT means the lock was not taken. Even if it could figure this out, we'd be leaving the pi_state->owner in an inconsistent state. This patch unlocks the rt_mutex prior to returning -EFAULT to userspace. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075606.9856.88729.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: use current->time_slack_ns for rt tasks tooDarren Hart2009-03-12
| | | | | | | | | | | | | | | | | RT tasks should set their timer slack to 0 on their own. This patch removes the 'if (rt_task()) slack = 0;' block in futex_wait. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20090312075559.9856.28822.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: add double_unlock_hb()Darren Hart2009-03-12
| | | | | | | | | | | | | | | | | | | | | Impact: cleanup The futex code uses double_lock_hb() which locks the hb->lock's in pointer value order. There is no parallel unlock routine, and the code unlocks them in name order, ignoring pointer value. This patch adds double_unlock_hb() to refactor the duplicated code segments. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075552.9856.48021.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: additional (get|put)_futex_key() fixesDarren Hart2009-03-12
| | | | | | | | | | | | | | | | | | | Impact: fix races futex_requeue and futex_lock_pi still had some bad (get|put)_futex_key() usage. This patch adds the missing put_futex_keys() and corrects a goto in futex_lock_pi() to avoid a double get. Build and boot tested on a 4 way Intel x86_64 workstation. Passes basic pthread_mutex and PI tests out of ltp/testcases/realtime. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075545.9856.75152.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: update futex commentaryDarren Hart2009-03-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: cleanup The futex_hash_bucket can be a bit confusing when first looking at the code as it is a shared queue (and futex_q isn't a queue at all, but rather an element on the queue). The mmap_sem is no longer held outside of the futex_handle_fault() routine, yet numerous comments refer to it. The fshared argument is no an integer. I left some of these comments along as they are simply removed in future patches. Some of the commentary refering to futexes by virtual page mappings was not very clear, and completely accurate (as for shared futexes both the page and the offset are used to determine the key). For the purposes of the function description, just referring to "the futex" seems sufficient. With hashed futexes we now access the page after the hash-bucket is locked, and not only after it is enqueued. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20090312075537.9856.29954.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: fix reference leakPeter Zijlstra2009-02-11
| | | | | | | | | | | | | | | | Catalin noticed that (38d47c1b7075: futex: rely on get_user_pages() for shared futexes) caused an mm_struct leak. Some tracing with the function graph tracer quickly pointed out that futex_wait() has exit paths with unbalanced reference counts. This regression was discovered by kmemleak. Reported-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* [CVE-2009-0029] System call wrappers part 31Heiko Carstens2009-01-14
| | | | Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
* [CVE-2009-0029] System call wrappers part 08Heiko Carstens2009-01-14
| | | | Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
*-. Merge branches 'core/futexes', 'core/locking', 'core/rcu' and 'linus' into ↵Ingo Molnar2009-01-06
|\ \ | | | | | | | | | core/urgent
| * | futex: catch certain assymetric (get|put)_futex_key callsDarren Hart2009-01-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: add debug check Following up on my previous key reference accounting patches, this patch will catch puts on keys that haven't been "got". This won't catch nested get/put mismatches though. Build and boot tested, with minimal desktop activity and a run of the open_posix_testsuite in LTP for testing. No warnings logged. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | futex: make futex_(get|put)_key() calls symmetricDarren Hart2008-12-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: cleanup This patch makes the calls to futex_get_key_refs() and futex_drop_key_refs() explicitly symmetric by only "putting" keys we successfully "got". Also cleanup a couple return points that didn't "put" after a successful "get". Build and boot tested on an x86_64 system. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | Merge branch 'core-for-linus' of ↵Linus Torvalds2008-12-30
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits) stacktrace: provide save_stack_trace_tsk() weak alias rcu: provide RCU options on non-preempt architectures too printk: fix discarding message when recursion_bug futex: clean up futex_(un)lock_pi fault handling "Tree RCU": scalable classic RCU implementation futex: rename field in futex_q to clarify single waiter semantics x86/swiotlb: add default swiotlb_arch_range_needs_mapping x86/swiotlb: add default phys<->bus conversion x86: unify pci iommu setup and allow swiotlb to compile for 32 bit x86: add swiotlb allocation functions swiotlb: consolidate swiotlb info message printing swiotlb: support bouncing of HighMem pages swiotlb: factor out copy to/from device swiotlb: add arch hook to force mapping swiotlb: allow architectures to override phys<->bus<->phys conversions swiotlb: add comment where we handle the overflow of a dma mask on 32 bit rcu: fix rcutorture behavior during reboot resources: skip sanity check of busy resources swiotlb: move some definitions to header swiotlb: allow architectures to override swiotlb pool allocation ... Fix up trivial conflicts in arch/x86/kernel/Makefile arch/x86/mm/init_32.c include/linux/hardirq.h as per Ingo's suggestions.
| * | futex: clean up futex_(un)lock_pi fault handlingDarren Hart2008-12-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: cleanup Some apparently left over cruft code was complicating the fault logic: Testing if uval != -EFAULT doesn't have any meaning, get_user() sets ret to either 0 or -EFAULT, there's no need to compare uval, especially not against EFAULT which it will never be. This patch removes the superfluous test and clarifies the comment blocks. Build and boot tested on an 8way x86_64 system. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | futex: rename field in futex_q to clarify single waiter semanticsDarren Hart2008-12-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: simplify code I've tripped over the naming of this field a couple times. The futex_q uses a "waiters" list to represent a single blocked task and then calles wake_up_all(). This can lead to confusion in trying to understand the intent of the code, which is to have a single futex_q for every task waiting on a futex. This patch corrects the problem, using a single pointer to the waiting task, and an appropriate call to wake_up, rather than wake_up_all. Compile and boot tested on an 8way x86_64 machine. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | futex: make clock selectable for FUTEX_WAIT_BITSETThomas Gleixner2008-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FUTEX_WAIT_BITSET could be used instead of FUTEX_WAIT by setting the bit set to FUTEX_BITSET_MATCH_ANY, but FUTEX_WAIT uses CLOCK_REALTIME while FUTEX_WAIT_BITSET uses CLOCK_MONOTONIC. Add a flag to select CLOCK_REALTIME for FUTEX_WAIT_BITSET so glibc can replace the FUTEX_WAIT logic which needs to do gettimeofday() calls before and after the syscall to convert the absolute timeout to a relative timeout for FUTEX_WAIT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ulrich Drepper <drepper@redhat.com>
| * | Merge branch 'linus' into core/futexesThomas Gleixner2008-11-24
| |\|
| * | futex: fixup get_futex_key() for private futexesPeter Zijlstra2008-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | With the get_user_pages_fast() patches we made get_futex_key() obtain a reference on the returned key, but failed to do so for private futexes. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | futex: cleanup fsharedPeter Zijlstra2008-09-30
| | | | | | | | | | | | | | | | | | | | | | | | fshared doesn't need to be a rw_sem pointer anymore, so clean that up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | futex: use fast_gup()Peter Zijlstra2008-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | Change the get_user_pages() call with fast_gup() which doesn't require holding the mmap_sem thereby removing the mmap_sem from all fast paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | futex: reduce mmap_sem usagePeter Zijlstra2008-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | now that we rely on get_user_pages() for the shared key handling move all the mmap_sem stuff closely around the slow paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | futex: rely on get_user_pages() for shared futexesPeter Zijlstra2008-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | On the way of getting rid of the mmap_sem requirement for shared futexes, start by relying on get_user_pages(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | CRED: Use RCU to access another task's creds and to release a task's own credsDavid Howells2008-11-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use RCU to access another task's creds and to release a task's own creds. This means that it will be possible for the credentials of a task to be replaced without another task (a) requiring a full lock to read them, and (b) seeing deallocated memory. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
* | | CRED: Separate task security context from task_structDavid Howells2008-11-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets. With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
* | | CRED: Wrap task credential accesses in the core kernelDavid Howells2008-11-13
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wrap access to task credentials so that they can be separated more easily from the task_struct during the introduction of COW creds. Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id(). Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more sense to use RCU directly rather than a convenient wrapper; these will be addressed by later patches. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-audit@redhat.com Cc: containers@lists.linux-foundation.org Cc: linux-mm@kvack.org Signed-off-by: James Morris <jmorris@namei.org>
* | hrtimer: make the futex() system call use the per process slack valueArjan van de Ven2008-09-11
| | | | | | | | | | | | | | | | This patch makes the futex() system call use the per process slack value; with this users are able to externally control existing applications to reduce the wakeup rate. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
* | hrtimer: convert kernel/* to the new hrtimer apisArjan van de Ven2008-09-06
|/ | | | | | | | In order to be able to do range hrtimers we need to use accessor functions to the "expire" member of the hrtimer struct. This patch converts kernel/* to these accessors. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
* futexes: fix fault handling in futex_lock_piThomas Gleixner2008-06-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch addresses a very sporadic pi-futex related failure in highly threaded java apps on large SMP systems. David Holmes reported that the pi_state consistency check in lookup_pi_state triggered with his test application. This means that the kernel internal pi_state and the user space futex variable are out of sync. First we assumed that this is a user space data corruption, but deeper investigation revieled that the problem happend because the pi-futex code is not handling a fault in the futex_lock_pi path when the user space variable needs to be fixed up. The fault happens when a fork mapped the anon memory which contains the futex readonly for COW or the page got swapped out exactly between the unlock of the futex and the return of either the new futex owner or the task which was the expected owner but failed to acquire the kernel internal rtmutex. The current futex_lock_pi() code drops out with an inconsistent in case it faults and returns -EFAULT to user space. User space has no way to fixup that state. When we wrote this code we thought that we could not drop the hash bucket lock at this point to handle the fault. After analysing the code again it turned out to be wrong because there are only two tasks involved which might modify the pi_state and the user space variable: - the task which acquired the rtmutex - the pending owner of the pi_state which did not get the rtmutex Both tasks drop into the fixup_pi_state() function before returning to user space. The first task which acquired the hash bucket lock faults in the fixup of the user space variable, drops the spinlock and calls futex_handle_fault() to fault in the page. Now the second task could acquire the hash bucket lock and tries to fixup the user space variable as well. It either faults as well or it succeeds because the first task already faulted the page in. One caveat is to avoid a double fixup. After returning from the fault handling we reacquire the hash bucket lock and check whether the pi_state owner has been modified already. Reported-by: David Holmes <david.holmes@sun.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Holmes <david.holmes@sun.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 73 insertions(+), 20 deletions(-)
* Removal of FUTEX_FDEric Sesterhenn2008-05-05
| | | | | | | | | | | | | | | Since FUTEX_FD was scheduled for removal in June 2007 lets remove it. Google Code search found no users for it and NGPT was abandoned in 2003 according to IBM. futex.h is left untouched to make sure the id does not get reassigned. Since queue_me() has no users left it is commented out to avoid a warning, i didnt remove it completely since it is part of the internal api (matching unqueue_me()) Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed rest) Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* add hrtimer specific debugobjects codeThomas Gleixner2008-04-30
| | | | | | | | | | | | | | | | | hrtimers have now dynamic users in the network code. Put them under debugobjects surveillance as well. Add calls to the generic object debugging infrastructure and provide fixup functions which allow to keep the system alive when recoverable problems have been detected by the object debugging core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* NULL noise: fs/*, mm/*, kernel/*Al Viro2008-03-30
| | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Give futex init a proper nameBenjamin Herrenschmidt2008-03-27
| | | | | | | | | | The futex init function is called init(). This is a pain in the neck when debugging when you code dies in ... init :-) This renames it to futex_init(). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: runtime enable pi and robust functionalityThomas Gleixner2008-02-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not all architectures implement futex_atomic_cmpxchg_inatomic(). The default implementation returns -ENOSYS, which is currently not handled inside of the futex guts. Futex PI calls and robust list exits with a held futex result in an endless loop in the futex code on architectures which have no support. Fixing up every place where futex_atomic_cmpxchg_inatomic() is called would add a fair amount of extra if/else constructs to the already complex code. It is also not possible to disable the robust feature before user space tries to register robust lists. Compile time disabling is not a good idea either, as there are already architectures with runtime detection of futex_atomic_cmpxchg_inatomic support. Detect the functionality at runtime instead by calling cmpxchg_futex_value_locked() with a NULL pointer from the futex initialization code. This is guaranteed to fail, but the call of futex_atomic_cmpxchg_inatomic() happens with pagefaults disabled. On architectures, which use the asm-generic implementation or have a runtime CPU feature detection, a -ENOSYS return value disables the PI/robust features. On architectures with a working implementation the call returns -EFAULT and the PI/robust features are enabled. The relevant syscalls return -ENOSYS and the robust list exit code is blocked, when the detection fails. Fixes http://lkml.org/lkml/2008/2/11/149 Originally reported by: Lennart Buytenhek Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Lennert Buytenhek <buytenh@wantstofly.org> Cc: Riku Voipio <riku.voipio@movial.fi> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: fix init orderThomas Gleixner2008-02-23
| | | | | | | | | | | | | | | | | When the futex init code fails to initialize the futex pseudo file system it returns early without initializing the hash queues. Should the boot succeed then a futex syscall which tries to enqueue a waiter on the hashqueue will crash due to the unitilialized plist heads. Initialize the hash queues before the filesystem. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Lennert Buytenhek <buytenh@wantstofly.org> Cc: Riku Voipio <riku.voipio@movial.fi> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hrtimer: check relative timeouts for overflowThomas Gleixner2008-02-14
| | | | | | | | | | | | | | | | | Various user space callers ask for relative timeouts. While we fixed that overflow issue in hrtimer_start(), the sites which convert relative user space values to absolute timeouts themself were uncovered. Instead of putting overflow checks into each place add a function which does the sanity checking and convert all affected callers to use it. Thanks to Frans Pop, who reported the problem and tested the fixes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Tested-by: Frans Pop <elendil@planet.nl>
* futex: Add bitset conditional wait/wakeup functionalityThomas Gleixner2008-02-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To allow the implementation of optimized rw-locks in user space, glibc needs a possibility to select waiters for wakeup depending on a bitset mask. This requires two new futex OPs: FUTEX_WAIT_BITS and FUTEX_WAKE_BITS These OPs are basically the same as FUTEX_WAIT and FUTEX_WAKE plus an additional argument - a bitset. Further the FUTEX_WAIT_BITS OP is expecting an absolute timeout value instead of the relative one, which is used for the FUTEX_WAIT OP. FUTEX_WAIT_BITS calls into the kernel with a bitset. The bitset is stored in the futex_q structure, which is used to enqueue the waiter into the hashed futex waitqueue. FUTEX_WAKE_BITS also calls into the kernel with a bitset. The wakeup function logically ANDs the bitset with the bitset stored in each waiters futex_q structure. If the result is zero (i.e. none of the set bits in the bitsets is matching), then the waiter is not woken up. If the result is not zero (i.e. one of the set bits in the bitsets is matching), then the waiter is woken. The bitset provided by the caller must be non zero. In case the provided bitset is zero the kernel returns EINVAL. Internaly the new OPs are only extensions to the existing FUTEX_WAIT and FUTEX_WAKE functions. The existing OPs hand a bitset with all bits set into the futex_wait() and futex_wake() functions. Signed-off-by: Thomas Gleixner <tgxl@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: Remove warn on in return fixup pathThomas Gleixner2008-02-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The WARN_ON() in the fixup return path of futex_lock_pi() can trigger with false positives. The following scenario happens: t1 holds the futex and t2 and t3 are blocked on the kernel side rt_mutex. t1 releases the futex (and the rt_mutex) and assigned t2 to be the next owner of the futex. t2 is interrupted and returns w/o acquiring the rt_mutex, before t1 can release the rtmutex. t1 releases the rtmutex and t3 becomes the pending owner of the rtmutex. t2 notices that it is the designated owner (user space variable) and fails to acquire the rt_mutex via trylock, because it is not allowed to steal the rt_mutex from t3. Now it looks at the rt_mutex pending owner (t3) and assigns the futex and the pi_state to it. During the fixup t4 steals the rtmutex from t3. t2 returns from the fixup and the owner of the rt_mutex has changed from t3 to t4. There is no need to do another round of fixups from t2. The important part (t2 is not returning as the user space visible owner) is done. The further fixups are done, before either t3 or t4 return to user space. For the user space it is not relevant which task (t3 or t4) is the real owner, as long as those are both in the kernel, which is guaranteed by the serialization of the hash bucket lock. Both tasks (which ever returns first to userspace - t4 because it locked the rt_mutex or t3 due to a signal) are going through the lock_futex_pi() return path where the ownership is fixed before the return to user space. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* hrtimer: fix hrtimer_init_sleeper() usersPeter Zijlstra2008-02-01
| | | | | | | | | | | | | | | | | | this patch: commit 37bb6cb4097e29ffee970065b74499cbf10603a3 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Fri Jan 25 21:08:32 2008 +0100 hrtimer: unlock hrtimer_wakeup Broke hrtimer_init_sleeper() users. It forgot to fix up the futex caller of this function to detect the failed queueing and messed up the do_nanosleep() caller in that it could leak a TASK_INTERRUPTIBLE state. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: Prevent stale futex owner when interrupted/timeoutThomas Gleixner2008-01-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Roland Westrelin did a great analysis of a long standing thinko in the return path of futex_lock_pi. While we fixed the lock steal case long ago, which was easy to trigger, we never had a test case which exposed this problem and stupidly never thought about the reverse lock stealing scenario and the return to user space with a stale state. When a blocked tasks returns from rt_mutex_timed_locked without holding the rt_mutex (due to a signal or timeout) and at the same time the task holding the futex is releasing the futex and assigning the ownership of the futex to the returning task, then it might happen that a third task acquires the rt_mutex before the final rt_mutex_trylock() of the returning task happens under the futex hash bucket lock. The returning task returns to user space with ETIMEOUT or EINTR, but the user space futex value is assigned to this task. The task which acquired the rt_mutex fixes the user space futex value right after the hash bucket lock has been released by the returning task, but for a short period of time the user space value is wrong. Detailed description is available at: https://bugzilla.redhat.com/show_bug.cgi?id=400541 The fix for this is the same as we do when the rt_mutex was acquired by a higher priority task via lock stealing from the designated new owner. In that case we already fix the user space value and the internal pi_state up before we return. This mechanism can be used to fixup the above corner case as well. When the returning task, which failed to acquire the rt_mutex, notices that it is the designated owner of the futex, then it fixes up the stale user space value and the pi_state, before returning to user space. This happens with the futex hash bucket lock held, so the task which acquired the rt_mutex is guaranteed to be blocked on the hash bucket lock. We can access the rt_mutex owner, which gives us the pid of the new owner, safely here as the owner is not able to modify (release) it while waiting on the hash bucket lock. Rename the "curr" argument of fixup_pi_state_owner() to "newowner" to avoid confusion with current and add the check for the stale state into the failure path of rt_mutex_trylock() in the return path of unlock_futex_pi(). If the situation is detected use fixup_pi_state_owner() to assign everything to the owner of the rt_mutex. Pointed-out-and-tested-by: Roland Westrelin <roland.westrelin@sun.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: correctly return -EFAULT not -EINVALThomas Gleixner2007-12-05
| | | | | | | return -EFAULT not -EINVAL. Found by review. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: fix for futex_wait signal stack corruptionSteven Rostedt2007-12-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David Holmes found a bug in the -rt tree with respect to pthread_cond_timedwait. After trying his test program on the latest git from mainline, I found the bug was there too. The bug he was seeing that his test program showed, was that if one were to do a "Ctrl-Z" on a process that was in the pthread_cond_timedwait, and then did a "bg" on that process, it would return with a "-ETIMEDOUT" but early. That is, the timer would go off early. Looking into this, I found the source of the problem. And it is a rather nasty bug at that. Here's the relevant code from kernel/futex.c: (not in order in the file) [...] smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val, struct timespec __user *utime, u32 __user *uaddr2, u32 val3) { struct timespec ts; ktime_t t, *tp = NULL; u32 val2 = 0; int cmd = op & FUTEX_CMD_MASK; if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); if (cmd == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } [...] return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); } [...] long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3) { int ret; int cmd = op & FUTEX_CMD_MASK; struct rw_semaphore *fshared = NULL; if (!(op & FUTEX_PRIVATE_FLAG)) fshared = &current->mm->mmap_sem; switch (cmd) { case FUTEX_WAIT: ret = futex_wait(uaddr, fshared, val, timeout); [...] static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared, u32 val, ktime_t *abs_time) { [...] struct restart_block *restart; restart = &current_thread_info()->restart_block; restart->fn = futex_wait_restart; restart->arg0 = (unsigned long)uaddr; restart->arg1 = (unsigned long)val; restart->arg2 = (unsigned long)abs_time; restart->arg3 = 0; if (fshared) restart->arg3 |= ARG3_SHARED; return -ERESTART_RESTARTBLOCK; [...] static long futex_wait_restart(struct restart_block *restart) { u32 __user *uaddr = (u32 __user *)restart->arg0; u32 val = (u32)restart->arg1; ktime_t *abs_time = (ktime_t *)restart->arg2; struct rw_semaphore *fshared = NULL; restart->fn = do_no_restart_syscall; if (restart->arg3 & ARG3_SHARED) fshared = &current->mm->mmap_sem; return (long)futex_wait(uaddr, fshared, val, abs_time); } So when the futex_wait is interrupt by a signal we break out of the hrtimer code and set up or return from signal. This code does not return back to userspace, so we set up a RESTARTBLOCK. The bug here is that we save the "abs_time" which is a pointer to the stack variable "ktime_t t" from sys_futex. This returns and unwinds the stack before we get to call our signal. On return from the signal we go to futex_wait_restart, where we update all the parameters for futex_wait and call it. But here we have a problem where abs_time is no longer valid. I verified this with print statements, and sure enough, what abs_time was set to ends up being garbage when we get to futex_wait_restart. The solution I did to solve this (with input from Linus Torvalds) was to add unions to the restart_block to allow system calls to use the restart with specific parameters. This way the futex code now saves the time in a 64bit value in the restart block instead of storing it on the stack. Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64 in thread_info.h, when there's a #ifdef __KERNEL__ just below that. Not sure what that is there for. If this turns out to be a problem, I've tested this with using "unsigned int" for u32 and "unsigned long long" for u64 and it worked just the same. I'm using u32 and u64 just to be consistent with what the futex code uses. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel/futex.c: make 3 functions staticAdrian Bunk2007-11-05
| | | | | | | | | | The following functions can now become static again: - get_futex_key() - get_futex_key_refs() - drop_futex_key_refs() Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* Uninline find_task_by_xxx set of functionsPavel Emelyanov2007-10-19
| | | | | | | | | | | | | | | | | | | | | | | | The find_task_by_something is a set of macros are used to find task by pid depending on what kind of pid is proposed - global or virtual one. All of them are wrappers above the most generic one - find_task_by_pid_type_ns() - and just substitute some args for it. It turned out, that dereferencing the current->nsproxy->pid_ns construction and pushing one more argument on the stack inline cause kernel text size to grow. This patch moves all this stuff out-of-line into kernel/pid.c. Together with the next patch it saves a bit less than 400 bytes from the .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: changes to show virtual ids to userPavel Emelyanov2007-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the largest patch in the set. Make all (I hope) the places where the pid is shown to or get from user operate on the virtual pids. The idea is: - all in-kernel data structures must store either struct pid itself or the pid's global nr, obtained with pid_nr() call; - when seeking the task from kernel code with the stored id one should use find_task_by_pid() call that works with global pids; - when showing pid's numerical value to the user the virtual one should be used, but however when one shows task's pid outside this task's namespace the global one is to be used; - when getting the pid from userspace one need to consider this as the virtual one and use appropriate task/pid-searching functions. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] [akpm@linux-foundation.org: yet nuther build fix] [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sparse pointer use of zero as nullStephen Hemminger2007-10-18
| | | | | | | | | | | | | | | | | Get rid of sparse related warnings from places that use integer as NULL pointer. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Cc: Jeff Garzik <jeff@garzik.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Ian Kent <raven@themaw.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* change inotifyfs magic as the same magic is used for futexfsAndrey Mirkin2007-10-17
| | | | | | | | | | | Right now futexfs and inotifyfs have one magic 0xBAD1DEA, that looks a little bit confusing. Use 0xBAD1DEA as magic for futexfs and 0x2BAD1DEA as magic for inotifyfs. Signed-off-by: Andrey Mirkin <major@openvz.org> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>