diff options
author | Manfred Spraul <manfred@colorfullife.com> | 2010-05-26 17:43:43 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2010-05-27 12:12:49 -0400 |
commit | c5cf6359ad1d322c16e159011247341849cc0d3a (patch) | |
tree | aefc0ff518c05d5fb386ab2103ec4dc25bffbe4d /ipc | |
parent | 31a7c4746e9925512afab30557dd445d677cc802 (diff) |
ipc/sem.c: update description of the implementation
ipc/sem.c begins with a 15 year old description about bugs in the initial
implementation in Linux-1.0. The patch replaces that with a top level
description of the current code.
A TODO could be derived from this text:
The opengroup man page for semop() does not mandate FIFO. Thus there is
no need for a semaphore array list of pending operations.
If
- this list is removed
- the per-semaphore array spinlock is removed (possible if there is no
list to protect)
- sem_otime is moved into the semaphores and calculated on demand during
semctl()
then the array would be read-mostly - which would significantly improve
scaling for applications that use semaphore arrays with lots of entries.
The price would be expensive semctl() calls:
for(i=0;i<sma->sem_nsems;i++) spin_lock(sma->sem_lock);
<do stuff>
for(i=0;i<sma->sem_nsems;i++) spin_unlock(sma->sem_lock);
I'm not sure if the complexity is worth the effort, thus here is the
documentation of the current behavior first.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'ipc')
-rw-r--r-- | ipc/sem.c | 103 |
1 files changed, 53 insertions, 50 deletions
@@ -3,56 +3,6 @@ | |||
3 | * Copyright (C) 1992 Krishna Balasubramanian | 3 | * Copyright (C) 1992 Krishna Balasubramanian |
4 | * Copyright (C) 1995 Eric Schenk, Bruno Haible | 4 | * Copyright (C) 1995 Eric Schenk, Bruno Haible |
5 | * | 5 | * |
6 | * IMPLEMENTATION NOTES ON CODE REWRITE (Eric Schenk, January 1995): | ||
7 | * This code underwent a massive rewrite in order to solve some problems | ||
8 | * with the original code. In particular the original code failed to | ||
9 | * wake up processes that were waiting for semval to go to 0 if the | ||
10 | * value went to 0 and was then incremented rapidly enough. In solving | ||
11 | * this problem I have also modified the implementation so that it | ||
12 | * processes pending operations in a FIFO manner, thus give a guarantee | ||
13 | * that processes waiting for a lock on the semaphore won't starve | ||
14 | * unless another locking process fails to unlock. | ||
15 | * In addition the following two changes in behavior have been introduced: | ||
16 | * - The original implementation of semop returned the value | ||
17 | * last semaphore element examined on success. This does not | ||
18 | * match the manual page specifications, and effectively | ||
19 | * allows the user to read the semaphore even if they do not | ||
20 | * have read permissions. The implementation now returns 0 | ||
21 | * on success as stated in the manual page. | ||
22 | * - There is some confusion over whether the set of undo adjustments | ||
23 | * to be performed at exit should be done in an atomic manner. | ||
24 | * That is, if we are attempting to decrement the semval should we queue | ||
25 | * up and wait until we can do so legally? | ||
26 | * The original implementation attempted to do this. | ||
27 | * The current implementation does not do so. This is because I don't | ||
28 | * think it is the right thing (TM) to do, and because I couldn't | ||
29 | * see a clean way to get the old behavior with the new design. | ||
30 | * The POSIX standard and SVID should be consulted to determine | ||
31 | * what behavior is mandated. | ||
32 | * | ||
33 | * Further notes on refinement (Christoph Rohland, December 1998): | ||
34 | * - The POSIX standard says, that the undo adjustments simply should | ||
35 | * redo. So the current implementation is o.K. | ||
36 | * - The previous code had two flaws: | ||
37 | * 1) It actively gave the semaphore to the next waiting process | ||
38 | * sleeping on the semaphore. Since this process did not have the | ||
39 | * cpu this led to many unnecessary context switches and bad | ||
40 | * performance. Now we only check which process should be able to | ||
41 | * get the semaphore and if this process wants to reduce some | ||
42 | * semaphore value we simply wake it up without doing the | ||
43 | * operation. So it has to try to get it later. Thus e.g. the | ||
44 | * running process may reacquire the semaphore during the current | ||
45 | * time slice. If it only waits for zero or increases the semaphore, | ||
46 | * we do the operation in advance and wake it up. | ||
47 | * 2) It did not wake up all zero waiting processes. We try to do | ||
48 | * better but only get the semops right which only wait for zero or | ||
49 | * increase. If there are decrement operations in the operations | ||
50 | * array we do the same as before. | ||
51 | * | ||
52 | * With the incarnation of O(1) scheduler, it becomes unnecessary to perform | ||
53 | * check/retry algorithm for waking up blocked processes as the new scheduler | ||
54 | * is better at handling thread switch than the old one. | ||
55 | * | ||
56 | * /proc/sysvipc/sem support (c) 1999 Dragos Acostachioaie <dragos@iname.com> | 6 | * /proc/sysvipc/sem support (c) 1999 Dragos Acostachioaie <dragos@iname.com> |
57 | * | 7 | * |
58 | * SMP-threaded, sysctl's added | 8 | * SMP-threaded, sysctl's added |
@@ -61,6 +11,8 @@ | |||
61 | * (c) 2001 Red Hat Inc | 11 | * (c) 2001 Red Hat Inc |
62 | * Lockless wakeup | 12 | * Lockless wakeup |
63 | * (c) 2003 Manfred Spraul <manfred@colorfullife.com> | 13 | * (c) 2003 Manfred Spraul <manfred@colorfullife.com> |
14 | * Further wakeup optimizations, documentation | ||
15 | * (c) 2010 Manfred Spraul <manfred@colorfullife.com> | ||
64 | * | 16 | * |
65 | * support for audit of ipc object properties and permission changes | 17 | * support for audit of ipc object properties and permission changes |
66 | * Dustin Kirkland <dustin.kirkland@us.ibm.com> | 18 | * Dustin Kirkland <dustin.kirkland@us.ibm.com> |
@@ -68,6 +20,57 @@ | |||
68 | * namespaces support | 20 | * namespaces support |
69 | * OpenVZ, SWsoft Inc. | 21 | * OpenVZ, SWsoft Inc. |
70 | * Pavel Emelianov <xemul@openvz.org> | 22 | * Pavel Emelianov <xemul@openvz.org> |
23 | * | ||
24 | * Implementation notes: (May 2010) | ||
25 | * This file implements System V semaphores. | ||
26 | * | ||
27 | * User space visible behavior: | ||
28 | * - FIFO ordering for semop() operations (just FIFO, not starvation | ||
29 | * protection) | ||
30 | * - multiple semaphore operations that alter the same semaphore in | ||
31 | * one semop() are handled. | ||
32 | * - sem_ctime (time of last semctl()) is updated in the IPC_SET, SETVAL and | ||
33 | * SETALL calls. | ||
34 | * - two Linux specific semctl() commands: SEM_STAT, SEM_INFO. | ||
35 | * - undo adjustments at process exit are limited to 0..SEMVMX. | ||
36 | * - namespace are supported. | ||
37 | * - SEMMSL, SEMMNS, SEMOPM and SEMMNI can be configured at runtine by writing | ||
38 | * to /proc/sys/kernel/sem. | ||
39 | * - statistics about the usage are reported in /proc/sysvipc/sem. | ||
40 | * | ||
41 | * Internals: | ||
42 | * - scalability: | ||
43 | * - all global variables are read-mostly. | ||
44 | * - semop() calls and semctl(RMID) are synchronized by RCU. | ||
45 | * - most operations do write operations (actually: spin_lock calls) to | ||
46 | * the per-semaphore array structure. | ||
47 | * Thus: Perfect SMP scaling between independent semaphore arrays. | ||
48 | * If multiple semaphores in one array are used, then cache line | ||
49 | * trashing on the semaphore array spinlock will limit the scaling. | ||
50 | * - semncnt and semzcnt are calculated on demand in count_semncnt() and | ||
51 | * count_semzcnt() | ||
52 | * - the task that performs a successful semop() scans the list of all | ||
53 | * sleeping tasks and completes any pending operations that can be fulfilled. | ||
54 | * Semaphores are actively given to waiting tasks (necessary for FIFO). | ||
55 | * (see update_queue()) | ||
56 | * - To improve the scalability, the actual wake-up calls are performed after | ||
57 | * dropping all locks. (see wake_up_sem_queue_prepare(), | ||
58 | * wake_up_sem_queue_do()) | ||
59 | * - All work is done by the waker, the woken up task does not have to do | ||
60 | * anything - not even acquiring a lock or dropping a refcount. | ||
61 | * - A woken up task may not even touch the semaphore array anymore, it may | ||
62 | * have been destroyed already by a semctl(RMID). | ||
63 | * - The synchronizations between wake-ups due to a timeout/signal and a | ||
64 | * wake-up due to a completed semaphore operation is achieved by using an | ||
65 | * intermediate state (IN_WAKEUP). | ||
66 | * - UNDO values are stored in an array (one per process and per | ||
67 | * semaphore array, lazily allocated). For backwards compatibility, multiple | ||
68 | * modes for the UNDO variables are supported (per process, per thread) | ||
69 | * (see copy_semundo, CLONE_SYSVSEM) | ||
70 | * - There are two lists of the pending operations: a per-array list | ||
71 | * and per-semaphore list (stored in the array). This allows to achieve FIFO | ||
72 | * ordering without always scanning all pending operations. | ||
73 | * The worst-case behavior is nevertheless O(N^2) for N wakeups. | ||
71 | */ | 74 | */ |
72 | 75 | ||
73 | #include <linux/slab.h> | 76 | #include <linux/slab.h> |