diff options
author | Davidlohr Bueso <davidlohr@hp.com> | 2014-07-30 16:41:55 -0400 |
---|---|---|
committer | Ingo Molnar <mingo@kernel.org> | 2014-08-13 04:32:03 -0400 |
commit | 214e0aed639ef40987bf6159fad303171a6de31e (patch) | |
tree | 9f4c2eb1497a7377de93d619c05cf6c82fcfa0cb /Documentation/locking | |
parent | 7608a43d8f2e02f8b532f8e11481d7ecf8b5d3f9 (diff) |
locking/Documentation: Move locking related docs into Documentation/locking/
Specifically:
Documentation/locking/lockdep-design.txt
Documentation/locking/lockstat.txt
Documentation/locking/mutex-design.txt
Documentation/locking/rt-mutex-design.txt
Documentation/locking/rt-mutex.txt
Documentation/locking/spinlocks.txt
Documentation/locking/ww-mutex-design.txt
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: jason.low2@hp.com
Cc: aswin@hp.com
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lubomir Rintel <lkundrak@v3.sk>
Cc: Masanari Iida <standby24x7@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: fengguang.wu@intel.com
Link: http://lkml.kernel.org/r/1406752916-3341-6-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'Documentation/locking')
-rw-r--r-- | Documentation/locking/lockdep-design.txt | 286 | ||||
-rw-r--r-- | Documentation/locking/lockstat.txt | 178 | ||||
-rw-r--r-- | Documentation/locking/mutex-design.txt | 157 | ||||
-rw-r--r-- | Documentation/locking/rt-mutex-design.txt | 781 | ||||
-rw-r--r-- | Documentation/locking/rt-mutex.txt | 79 | ||||
-rw-r--r-- | Documentation/locking/spinlocks.txt | 167 | ||||
-rw-r--r-- | Documentation/locking/ww-mutex-design.txt | 344 |
7 files changed, 1992 insertions, 0 deletions
diff --git a/Documentation/locking/lockdep-design.txt b/Documentation/locking/lockdep-design.txt new file mode 100644 index 000000000000..5dbc99c04f6e --- /dev/null +++ b/Documentation/locking/lockdep-design.txt | |||
@@ -0,0 +1,286 @@ | |||
1 | Runtime locking correctness validator | ||
2 | ===================================== | ||
3 | |||
4 | started by Ingo Molnar <mingo@redhat.com> | ||
5 | additions by Arjan van de Ven <arjan@linux.intel.com> | ||
6 | |||
7 | Lock-class | ||
8 | ---------- | ||
9 | |||
10 | The basic object the validator operates upon is a 'class' of locks. | ||
11 | |||
12 | A class of locks is a group of locks that are logically the same with | ||
13 | respect to locking rules, even if the locks may have multiple (possibly | ||
14 | tens of thousands of) instantiations. For example a lock in the inode | ||
15 | struct is one class, while each inode has its own instantiation of that | ||
16 | lock class. | ||
17 | |||
18 | The validator tracks the 'state' of lock-classes, and it tracks | ||
19 | dependencies between different lock-classes. The validator maintains a | ||
20 | rolling proof that the state and the dependencies are correct. | ||
21 | |||
22 | Unlike an lock instantiation, the lock-class itself never goes away: when | ||
23 | a lock-class is used for the first time after bootup it gets registered, | ||
24 | and all subsequent uses of that lock-class will be attached to this | ||
25 | lock-class. | ||
26 | |||
27 | State | ||
28 | ----- | ||
29 | |||
30 | The validator tracks lock-class usage history into 4n + 1 separate state bits: | ||
31 | |||
32 | - 'ever held in STATE context' | ||
33 | - 'ever held as readlock in STATE context' | ||
34 | - 'ever held with STATE enabled' | ||
35 | - 'ever held as readlock with STATE enabled' | ||
36 | |||
37 | Where STATE can be either one of (kernel/lockdep_states.h) | ||
38 | - hardirq | ||
39 | - softirq | ||
40 | - reclaim_fs | ||
41 | |||
42 | - 'ever used' [ == !unused ] | ||
43 | |||
44 | When locking rules are violated, these state bits are presented in the | ||
45 | locking error messages, inside curlies. A contrived example: | ||
46 | |||
47 | modprobe/2287 is trying to acquire lock: | ||
48 | (&sio_locks[i].lock){-.-...}, at: [<c02867fd>] mutex_lock+0x21/0x24 | ||
49 | |||
50 | but task is already holding lock: | ||
51 | (&sio_locks[i].lock){-.-...}, at: [<c02867fd>] mutex_lock+0x21/0x24 | ||
52 | |||
53 | |||
54 | The bit position indicates STATE, STATE-read, for each of the states listed | ||
55 | above, and the character displayed in each indicates: | ||
56 | |||
57 | '.' acquired while irqs disabled and not in irq context | ||
58 | '-' acquired in irq context | ||
59 | '+' acquired with irqs enabled | ||
60 | '?' acquired in irq context with irqs enabled. | ||
61 | |||
62 | Unused mutexes cannot be part of the cause of an error. | ||
63 | |||
64 | |||
65 | Single-lock state rules: | ||
66 | ------------------------ | ||
67 | |||
68 | A softirq-unsafe lock-class is automatically hardirq-unsafe as well. The | ||
69 | following states are exclusive, and only one of them is allowed to be | ||
70 | set for any lock-class: | ||
71 | |||
72 | <hardirq-safe> and <hardirq-unsafe> | ||
73 | <softirq-safe> and <softirq-unsafe> | ||
74 | |||
75 | The validator detects and reports lock usage that violate these | ||
76 | single-lock state rules. | ||
77 | |||
78 | Multi-lock dependency rules: | ||
79 | ---------------------------- | ||
80 | |||
81 | The same lock-class must not be acquired twice, because this could lead | ||
82 | to lock recursion deadlocks. | ||
83 | |||
84 | Furthermore, two locks may not be taken in different order: | ||
85 | |||
86 | <L1> -> <L2> | ||
87 | <L2> -> <L1> | ||
88 | |||
89 | because this could lead to lock inversion deadlocks. (The validator | ||
90 | finds such dependencies in arbitrary complexity, i.e. there can be any | ||
91 | other locking sequence between the acquire-lock operations, the | ||
92 | validator will still track all dependencies between locks.) | ||
93 | |||
94 | Furthermore, the following usage based lock dependencies are not allowed | ||
95 | between any two lock-classes: | ||
96 | |||
97 | <hardirq-safe> -> <hardirq-unsafe> | ||
98 | <softirq-safe> -> <softirq-unsafe> | ||
99 | |||
100 | The first rule comes from the fact the a hardirq-safe lock could be | ||
101 | taken by a hardirq context, interrupting a hardirq-unsafe lock - and | ||
102 | thus could result in a lock inversion deadlock. Likewise, a softirq-safe | ||
103 | lock could be taken by an softirq context, interrupting a softirq-unsafe | ||
104 | lock. | ||
105 | |||
106 | The above rules are enforced for any locking sequence that occurs in the | ||
107 | kernel: when acquiring a new lock, the validator checks whether there is | ||
108 | any rule violation between the new lock and any of the held locks. | ||
109 | |||
110 | When a lock-class changes its state, the following aspects of the above | ||
111 | dependency rules are enforced: | ||
112 | |||
113 | - if a new hardirq-safe lock is discovered, we check whether it | ||
114 | took any hardirq-unsafe lock in the past. | ||
115 | |||
116 | - if a new softirq-safe lock is discovered, we check whether it took | ||
117 | any softirq-unsafe lock in the past. | ||
118 | |||
119 | - if a new hardirq-unsafe lock is discovered, we check whether any | ||
120 | hardirq-safe lock took it in the past. | ||
121 | |||
122 | - if a new softirq-unsafe lock is discovered, we check whether any | ||
123 | softirq-safe lock took it in the past. | ||
124 | |||
125 | (Again, we do these checks too on the basis that an interrupt context | ||
126 | could interrupt _any_ of the irq-unsafe or hardirq-unsafe locks, which | ||
127 | could lead to a lock inversion deadlock - even if that lock scenario did | ||
128 | not trigger in practice yet.) | ||
129 | |||
130 | Exception: Nested data dependencies leading to nested locking | ||
131 | ------------------------------------------------------------- | ||
132 | |||
133 | There are a few cases where the Linux kernel acquires more than one | ||
134 | instance of the same lock-class. Such cases typically happen when there | ||
135 | is some sort of hierarchy within objects of the same type. In these | ||
136 | cases there is an inherent "natural" ordering between the two objects | ||
137 | (defined by the properties of the hierarchy), and the kernel grabs the | ||
138 | locks in this fixed order on each of the objects. | ||
139 | |||
140 | An example of such an object hierarchy that results in "nested locking" | ||
141 | is that of a "whole disk" block-dev object and a "partition" block-dev | ||
142 | object; the partition is "part of" the whole device and as long as one | ||
143 | always takes the whole disk lock as a higher lock than the partition | ||
144 | lock, the lock ordering is fully correct. The validator does not | ||
145 | automatically detect this natural ordering, as the locking rule behind | ||
146 | the ordering is not static. | ||
147 | |||
148 | In order to teach the validator about this correct usage model, new | ||
149 | versions of the various locking primitives were added that allow you to | ||
150 | specify a "nesting level". An example call, for the block device mutex, | ||
151 | looks like this: | ||
152 | |||
153 | enum bdev_bd_mutex_lock_class | ||
154 | { | ||
155 | BD_MUTEX_NORMAL, | ||
156 | BD_MUTEX_WHOLE, | ||
157 | BD_MUTEX_PARTITION | ||
158 | }; | ||
159 | |||
160 | mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION); | ||
161 | |||
162 | In this case the locking is done on a bdev object that is known to be a | ||
163 | partition. | ||
164 | |||
165 | The validator treats a lock that is taken in such a nested fashion as a | ||
166 | separate (sub)class for the purposes of validation. | ||
167 | |||
168 | Note: When changing code to use the _nested() primitives, be careful and | ||
169 | check really thoroughly that the hierarchy is correctly mapped; otherwise | ||
170 | you can get false positives or false negatives. | ||
171 | |||
172 | Proof of 100% correctness: | ||
173 | -------------------------- | ||
174 | |||
175 | The validator achieves perfect, mathematical 'closure' (proof of locking | ||
176 | correctness) in the sense that for every simple, standalone single-task | ||
177 | locking sequence that occurred at least once during the lifetime of the | ||
178 | kernel, the validator proves it with a 100% certainty that no | ||
179 | combination and timing of these locking sequences can cause any class of | ||
180 | lock related deadlock. [*] | ||
181 | |||
182 | I.e. complex multi-CPU and multi-task locking scenarios do not have to | ||
183 | occur in practice to prove a deadlock: only the simple 'component' | ||
184 | locking chains have to occur at least once (anytime, in any | ||
185 | task/context) for the validator to be able to prove correctness. (For | ||
186 | example, complex deadlocks that would normally need more than 3 CPUs and | ||
187 | a very unlikely constellation of tasks, irq-contexts and timings to | ||
188 | occur, can be detected on a plain, lightly loaded single-CPU system as | ||
189 | well!) | ||
190 | |||
191 | This radically decreases the complexity of locking related QA of the | ||
192 | kernel: what has to be done during QA is to trigger as many "simple" | ||
193 | single-task locking dependencies in the kernel as possible, at least | ||
194 | once, to prove locking correctness - instead of having to trigger every | ||
195 | possible combination of locking interaction between CPUs, combined with | ||
196 | every possible hardirq and softirq nesting scenario (which is impossible | ||
197 | to do in practice). | ||
198 | |||
199 | [*] assuming that the validator itself is 100% correct, and no other | ||
200 | part of the system corrupts the state of the validator in any way. | ||
201 | We also assume that all NMI/SMM paths [which could interrupt | ||
202 | even hardirq-disabled codepaths] are correct and do not interfere | ||
203 | with the validator. We also assume that the 64-bit 'chain hash' | ||
204 | value is unique for every lock-chain in the system. Also, lock | ||
205 | recursion must not be higher than 20. | ||
206 | |||
207 | Performance: | ||
208 | ------------ | ||
209 | |||
210 | The above rules require _massive_ amounts of runtime checking. If we did | ||
211 | that for every lock taken and for every irqs-enable event, it would | ||
212 | render the system practically unusably slow. The complexity of checking | ||
213 | is O(N^2), so even with just a few hundred lock-classes we'd have to do | ||
214 | tens of thousands of checks for every event. | ||
215 | |||
216 | This problem is solved by checking any given 'locking scenario' (unique | ||
217 | sequence of locks taken after each other) only once. A simple stack of | ||
218 | held locks is maintained, and a lightweight 64-bit hash value is | ||
219 | calculated, which hash is unique for every lock chain. The hash value, | ||
220 | when the chain is validated for the first time, is then put into a hash | ||
221 | table, which hash-table can be checked in a lockfree manner. If the | ||
222 | locking chain occurs again later on, the hash table tells us that we | ||
223 | dont have to validate the chain again. | ||
224 | |||
225 | Troubleshooting: | ||
226 | ---------------- | ||
227 | |||
228 | The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes. | ||
229 | Exceeding this number will trigger the following lockdep warning: | ||
230 | |||
231 | (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)) | ||
232 | |||
233 | By default, MAX_LOCKDEP_KEYS is currently set to 8191, and typical | ||
234 | desktop systems have less than 1,000 lock classes, so this warning | ||
235 | normally results from lock-class leakage or failure to properly | ||
236 | initialize locks. These two problems are illustrated below: | ||
237 | |||
238 | 1. Repeated module loading and unloading while running the validator | ||
239 | will result in lock-class leakage. The issue here is that each | ||
240 | load of the module will create a new set of lock classes for | ||
241 | that module's locks, but module unloading does not remove old | ||
242 | classes (see below discussion of reuse of lock classes for why). | ||
243 | Therefore, if that module is loaded and unloaded repeatedly, | ||
244 | the number of lock classes will eventually reach the maximum. | ||
245 | |||
246 | 2. Using structures such as arrays that have large numbers of | ||
247 | locks that are not explicitly initialized. For example, | ||
248 | a hash table with 8192 buckets where each bucket has its own | ||
249 | spinlock_t will consume 8192 lock classes -unless- each spinlock | ||
250 | is explicitly initialized at runtime, for example, using the | ||
251 | run-time spin_lock_init() as opposed to compile-time initializers | ||
252 | such as __SPIN_LOCK_UNLOCKED(). Failure to properly initialize | ||
253 | the per-bucket spinlocks would guarantee lock-class overflow. | ||
254 | In contrast, a loop that called spin_lock_init() on each lock | ||
255 | would place all 8192 locks into a single lock class. | ||
256 | |||
257 | The moral of this story is that you should always explicitly | ||
258 | initialize your locks. | ||
259 | |||
260 | One might argue that the validator should be modified to allow | ||
261 | lock classes to be reused. However, if you are tempted to make this | ||
262 | argument, first review the code and think through the changes that would | ||
263 | be required, keeping in mind that the lock classes to be removed are | ||
264 | likely to be linked into the lock-dependency graph. This turns out to | ||
265 | be harder to do than to say. | ||
266 | |||
267 | Of course, if you do run out of lock classes, the next thing to do is | ||
268 | to find the offending lock classes. First, the following command gives | ||
269 | you the number of lock classes currently in use along with the maximum: | ||
270 | |||
271 | grep "lock-classes" /proc/lockdep_stats | ||
272 | |||
273 | This command produces the following output on a modest system: | ||
274 | |||
275 | lock-classes: 748 [max: 8191] | ||
276 | |||
277 | If the number allocated (748 above) increases continually over time, | ||
278 | then there is likely a leak. The following command can be used to | ||
279 | identify the leaking lock classes: | ||
280 | |||
281 | grep "BD" /proc/lockdep | ||
282 | |||
283 | Run the command and save the output, then compare against the output from | ||
284 | a later run of this command to identify the leakers. This same output | ||
285 | can also help you find situations where runtime lock initialization has | ||
286 | been omitted. | ||
diff --git a/Documentation/locking/lockstat.txt b/Documentation/locking/lockstat.txt new file mode 100644 index 000000000000..7428773a1e69 --- /dev/null +++ b/Documentation/locking/lockstat.txt | |||
@@ -0,0 +1,178 @@ | |||
1 | |||
2 | LOCK STATISTICS | ||
3 | |||
4 | - WHAT | ||
5 | |||
6 | As the name suggests, it provides statistics on locks. | ||
7 | |||
8 | - WHY | ||
9 | |||
10 | Because things like lock contention can severely impact performance. | ||
11 | |||
12 | - HOW | ||
13 | |||
14 | Lockdep already has hooks in the lock functions and maps lock instances to | ||
15 | lock classes. We build on that (see Documentation/lokcing/lockdep-design.txt). | ||
16 | The graph below shows the relation between the lock functions and the various | ||
17 | hooks therein. | ||
18 | |||
19 | __acquire | ||
20 | | | ||
21 | lock _____ | ||
22 | | \ | ||
23 | | __contended | ||
24 | | | | ||
25 | | <wait> | ||
26 | | _______/ | ||
27 | |/ | ||
28 | | | ||
29 | __acquired | ||
30 | | | ||
31 | . | ||
32 | <hold> | ||
33 | . | ||
34 | | | ||
35 | __release | ||
36 | | | ||
37 | unlock | ||
38 | |||
39 | lock, unlock - the regular lock functions | ||
40 | __* - the hooks | ||
41 | <> - states | ||
42 | |||
43 | With these hooks we provide the following statistics: | ||
44 | |||
45 | con-bounces - number of lock contention that involved x-cpu data | ||
46 | contentions - number of lock acquisitions that had to wait | ||
47 | wait time min - shortest (non-0) time we ever had to wait for a lock | ||
48 | max - longest time we ever had to wait for a lock | ||
49 | total - total time we spend waiting on this lock | ||
50 | avg - average time spent waiting on this lock | ||
51 | acq-bounces - number of lock acquisitions that involved x-cpu data | ||
52 | acquisitions - number of times we took the lock | ||
53 | hold time min - shortest (non-0) time we ever held the lock | ||
54 | max - longest time we ever held the lock | ||
55 | total - total time this lock was held | ||
56 | avg - average time this lock was held | ||
57 | |||
58 | These numbers are gathered per lock class, per read/write state (when | ||
59 | applicable). | ||
60 | |||
61 | It also tracks 4 contention points per class. A contention point is a call site | ||
62 | that had to wait on lock acquisition. | ||
63 | |||
64 | - CONFIGURATION | ||
65 | |||
66 | Lock statistics are enabled via CONFIG_LOCK_STAT. | ||
67 | |||
68 | - USAGE | ||
69 | |||
70 | Enable collection of statistics: | ||
71 | |||
72 | # echo 1 >/proc/sys/kernel/lock_stat | ||
73 | |||
74 | Disable collection of statistics: | ||
75 | |||
76 | # echo 0 >/proc/sys/kernel/lock_stat | ||
77 | |||
78 | Look at the current lock statistics: | ||
79 | |||
80 | ( line numbers not part of actual output, done for clarity in the explanation | ||
81 | below ) | ||
82 | |||
83 | # less /proc/lock_stat | ||
84 | |||
85 | 01 lock_stat version 0.4 | ||
86 | 02----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
87 | 03 class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg | ||
88 | 04----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
89 | 05 | ||
90 | 06 &mm->mmap_sem-W: 46 84 0.26 939.10 16371.53 194.90 47291 2922365 0.16 2220301.69 17464026916.32 5975.99 | ||
91 | 07 &mm->mmap_sem-R: 37 100 1.31 299502.61 325629.52 3256.30 212344 34316685 0.10 7744.91 95016910.20 2.77 | ||
92 | 08 --------------- | ||
93 | 09 &mm->mmap_sem 1 [<ffffffff811502a7>] khugepaged_scan_mm_slot+0x57/0x280 | ||
94 | 19 &mm->mmap_sem 96 [<ffffffff815351c4>] __do_page_fault+0x1d4/0x510 | ||
95 | 11 &mm->mmap_sem 34 [<ffffffff81113d77>] vm_mmap_pgoff+0x87/0xd0 | ||
96 | 12 &mm->mmap_sem 17 [<ffffffff81127e71>] vm_munmap+0x41/0x80 | ||
97 | 13 --------------- | ||
98 | 14 &mm->mmap_sem 1 [<ffffffff81046fda>] dup_mmap+0x2a/0x3f0 | ||
99 | 15 &mm->mmap_sem 60 [<ffffffff81129e29>] SyS_mprotect+0xe9/0x250 | ||
100 | 16 &mm->mmap_sem 41 [<ffffffff815351c4>] __do_page_fault+0x1d4/0x510 | ||
101 | 17 &mm->mmap_sem 68 [<ffffffff81113d77>] vm_mmap_pgoff+0x87/0xd0 | ||
102 | 18 | ||
103 | 19............................................................................................................................................................................................................................. | ||
104 | 20 | ||
105 | 21 unix_table_lock: 110 112 0.21 49.24 163.91 1.46 21094 66312 0.12 624.42 31589.81 0.48 | ||
106 | 22 --------------- | ||
107 | 23 unix_table_lock 45 [<ffffffff8150ad8e>] unix_create1+0x16e/0x1b0 | ||
108 | 24 unix_table_lock 47 [<ffffffff8150b111>] unix_release_sock+0x31/0x250 | ||
109 | 25 unix_table_lock 15 [<ffffffff8150ca37>] unix_find_other+0x117/0x230 | ||
110 | 26 unix_table_lock 5 [<ffffffff8150a09f>] unix_autobind+0x11f/0x1b0 | ||
111 | 27 --------------- | ||
112 | 28 unix_table_lock 39 [<ffffffff8150b111>] unix_release_sock+0x31/0x250 | ||
113 | 29 unix_table_lock 49 [<ffffffff8150ad8e>] unix_create1+0x16e/0x1b0 | ||
114 | 30 unix_table_lock 20 [<ffffffff8150ca37>] unix_find_other+0x117/0x230 | ||
115 | 31 unix_table_lock 4 [<ffffffff8150a09f>] unix_autobind+0x11f/0x1b0 | ||
116 | |||
117 | |||
118 | This excerpt shows the first two lock class statistics. Line 01 shows the | ||
119 | output version - each time the format changes this will be updated. Line 02-04 | ||
120 | show the header with column descriptions. Lines 05-18 and 20-31 show the actual | ||
121 | statistics. These statistics come in two parts; the actual stats separated by a | ||
122 | short separator (line 08, 13) from the contention points. | ||
123 | |||
124 | The first lock (05-18) is a read/write lock, and shows two lines above the | ||
125 | short separator. The contention points don't match the column descriptors, | ||
126 | they have two: contentions and [<IP>] symbol. The second set of contention | ||
127 | points are the points we're contending with. | ||
128 | |||
129 | The integer part of the time values is in us. | ||
130 | |||
131 | Dealing with nested locks, subclasses may appear: | ||
132 | |||
133 | 32........................................................................................................................................................................................................................... | ||
134 | 33 | ||
135 | 34 &rq->lock: 13128 13128 0.43 190.53 103881.26 7.91 97454 3453404 0.00 401.11 13224683.11 3.82 | ||
136 | 35 --------- | ||
137 | 36 &rq->lock 645 [<ffffffff8103bfc4>] task_rq_lock+0x43/0x75 | ||
138 | 37 &rq->lock 297 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a | ||
139 | 38 &rq->lock 360 [<ffffffff8103c4c5>] select_task_rq_fair+0x1f0/0x74a | ||
140 | 39 &rq->lock 428 [<ffffffff81045f98>] scheduler_tick+0x46/0x1fb | ||
141 | 40 --------- | ||
142 | 41 &rq->lock 77 [<ffffffff8103bfc4>] task_rq_lock+0x43/0x75 | ||
143 | 42 &rq->lock 174 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a | ||
144 | 43 &rq->lock 4715 [<ffffffff8103ed4b>] double_rq_lock+0x42/0x54 | ||
145 | 44 &rq->lock 893 [<ffffffff81340524>] schedule+0x157/0x7b8 | ||
146 | 45 | ||
147 | 46........................................................................................................................................................................................................................... | ||
148 | 47 | ||
149 | 48 &rq->lock/1: 1526 11488 0.33 388.73 136294.31 11.86 21461 38404 0.00 37.93 109388.53 2.84 | ||
150 | 49 ----------- | ||
151 | 50 &rq->lock/1 11526 [<ffffffff8103ed58>] double_rq_lock+0x4f/0x54 | ||
152 | 51 ----------- | ||
153 | 52 &rq->lock/1 5645 [<ffffffff8103ed4b>] double_rq_lock+0x42/0x54 | ||
154 | 53 &rq->lock/1 1224 [<ffffffff81340524>] schedule+0x157/0x7b8 | ||
155 | 54 &rq->lock/1 4336 [<ffffffff8103ed58>] double_rq_lock+0x4f/0x54 | ||
156 | 55 &rq->lock/1 181 [<ffffffff8104ba65>] try_to_wake_up+0x127/0x25a | ||
157 | |||
158 | Line 48 shows statistics for the second subclass (/1) of &rq->lock class | ||
159 | (subclass starts from 0), since in this case, as line 50 suggests, | ||
160 | double_rq_lock actually acquires a nested lock of two spinlocks. | ||
161 | |||
162 | View the top contending locks: | ||
163 | |||
164 | # grep : /proc/lock_stat | head | ||
165 | clockevents_lock: 2926159 2947636 0.15 46882.81 1784540466.34 605.41 3381345 3879161 0.00 2260.97 53178395.68 13.71 | ||
166 | tick_broadcast_lock: 346460 346717 0.18 2257.43 39364622.71 113.54 3642919 4242696 0.00 2263.79 49173646.60 11.59 | ||
167 | &mapping->i_mmap_mutex: 203896 203899 3.36 645530.05 31767507988.39 155800.21 3361776 8893984 0.17 2254.15 14110121.02 1.59 | ||
168 | &rq->lock: 135014 136909 0.18 606.09 842160.68 6.15 1540728 10436146 0.00 728.72 17606683.41 1.69 | ||
169 | &(&zone->lru_lock)->rlock: 93000 94934 0.16 59.18 188253.78 1.98 1199912 3809894 0.15 391.40 3559518.81 0.93 | ||
170 | tasklist_lock-W: 40667 41130 0.23 1189.42 428980.51 10.43 270278 510106 0.16 653.51 3939674.91 7.72 | ||
171 | tasklist_lock-R: 21298 21305 0.20 1310.05 215511.12 10.12 186204 241258 0.14 1162.33 1179779.23 4.89 | ||
172 | rcu_node_1: 47656 49022 0.16 635.41 193616.41 3.95 844888 1865423 0.00 764.26 1656226.96 0.89 | ||
173 | &(&dentry->d_lockref.lock)->rlock: 39791 40179 0.15 1302.08 88851.96 2.21 2790851 12527025 0.10 1910.75 3379714.27 0.27 | ||
174 | rcu_node_0: 29203 30064 0.16 786.55 1555573.00 51.74 88963 244254 0.00 398.87 428872.51 1.76 | ||
175 | |||
176 | Clear the statistics: | ||
177 | |||
178 | # echo 0 > /proc/lock_stat | ||
diff --git a/Documentation/locking/mutex-design.txt b/Documentation/locking/mutex-design.txt new file mode 100644 index 000000000000..ee231ed09ec6 --- /dev/null +++ b/Documentation/locking/mutex-design.txt | |||
@@ -0,0 +1,157 @@ | |||
1 | Generic Mutex Subsystem | ||
2 | |||
3 | started by Ingo Molnar <mingo@redhat.com> | ||
4 | updated by Davidlohr Bueso <davidlohr@hp.com> | ||
5 | |||
6 | What are mutexes? | ||
7 | ----------------- | ||
8 | |||
9 | In the Linux kernel, mutexes refer to a particular locking primitive | ||
10 | that enforces serialization on shared memory systems, and not only to | ||
11 | the generic term referring to 'mutual exclusion' found in academia | ||
12 | or similar theoretical text books. Mutexes are sleeping locks which | ||
13 | behave similarly to binary semaphores, and were introduced in 2006[1] | ||
14 | as an alternative to these. This new data structure provided a number | ||
15 | of advantages, including simpler interfaces, and at that time smaller | ||
16 | code (see Disadvantages). | ||
17 | |||
18 | [1] http://lwn.net/Articles/164802/ | ||
19 | |||
20 | Implementation | ||
21 | -------------- | ||
22 | |||
23 | Mutexes are represented by 'struct mutex', defined in include/linux/mutex.h | ||
24 | and implemented in kernel/locking/mutex.c. These locks use a three | ||
25 | state atomic counter (->count) to represent the different possible | ||
26 | transitions that can occur during the lifetime of a lock: | ||
27 | |||
28 | 1: unlocked | ||
29 | 0: locked, no waiters | ||
30 | negative: locked, with potential waiters | ||
31 | |||
32 | In its most basic form it also includes a wait-queue and a spinlock | ||
33 | that serializes access to it. CONFIG_SMP systems can also include | ||
34 | a pointer to the lock task owner (->owner) as well as a spinner MCS | ||
35 | lock (->osq), both described below in (ii). | ||
36 | |||
37 | When acquiring a mutex, there are three possible paths that can be | ||
38 | taken, depending on the state of the lock: | ||
39 | |||
40 | (i) fastpath: tries to atomically acquire the lock by decrementing the | ||
41 | counter. If it was already taken by another task it goes to the next | ||
42 | possible path. This logic is architecture specific. On x86-64, the | ||
43 | locking fastpath is 2 instructions: | ||
44 | |||
45 | 0000000000000e10 <mutex_lock>: | ||
46 | e21: f0 ff 0b lock decl (%rbx) | ||
47 | e24: 79 08 jns e2e <mutex_lock+0x1e> | ||
48 | |||
49 | the unlocking fastpath is equally tight: | ||
50 | |||
51 | 0000000000000bc0 <mutex_unlock>: | ||
52 | bc8: f0 ff 07 lock incl (%rdi) | ||
53 | bcb: 7f 0a jg bd7 <mutex_unlock+0x17> | ||
54 | |||
55 | |||
56 | (ii) midpath: aka optimistic spinning, tries to spin for acquisition | ||
57 | while the lock owner is running and there are no other tasks ready | ||
58 | to run that have higher priority (need_resched). The rationale is | ||
59 | that if the lock owner is running, it is likely to release the lock | ||
60 | soon. The mutex spinners are queued up using MCS lock so that only | ||
61 | one spinner can compete for the mutex. | ||
62 | |||
63 | The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spinlock | ||
64 | with the desirable properties of being fair and with each cpu trying | ||
65 | to acquire the lock spinning on a local variable. It avoids expensive | ||
66 | cacheline bouncing that common test-and-set spinlock implementations | ||
67 | incur. An MCS-like lock is specially tailored for optimistic spinning | ||
68 | for sleeping lock implementation. An important feature of the customized | ||
69 | MCS lock is that it has the extra property that spinners are able to exit | ||
70 | the MCS spinlock queue when they need to reschedule. This further helps | ||
71 | avoid situations where MCS spinners that need to reschedule would continue | ||
72 | waiting to spin on mutex owner, only to go directly to slowpath upon | ||
73 | obtaining the MCS lock. | ||
74 | |||
75 | |||
76 | (iii) slowpath: last resort, if the lock is still unable to be acquired, | ||
77 | the task is added to the wait-queue and sleeps until woken up by the | ||
78 | unlock path. Under normal circumstances it blocks as TASK_UNINTERRUPTIBLE. | ||
79 | |||
80 | While formally kernel mutexes are sleepable locks, it is path (ii) that | ||
81 | makes them more practically a hybrid type. By simply not interrupting a | ||
82 | task and busy-waiting for a few cycles instead of immediately sleeping, | ||
83 | the performance of this lock has been seen to significantly improve a | ||
84 | number of workloads. Note that this technique is also used for rw-semaphores. | ||
85 | |||
86 | Semantics | ||
87 | --------- | ||
88 | |||
89 | The mutex subsystem checks and enforces the following rules: | ||
90 | |||
91 | - Only one task can hold the mutex at a time. | ||
92 | - Only the owner can unlock the mutex. | ||
93 | - Multiple unlocks are not permitted. | ||
94 | - Recursive locking/unlocking is not permitted. | ||
95 | - A mutex must only be initialized via the API (see below). | ||
96 | - A task may not exit with a mutex held. | ||
97 | - Memory areas where held locks reside must not be freed. | ||
98 | - Held mutexes must not be reinitialized. | ||
99 | - Mutexes may not be used in hardware or software interrupt | ||
100 | contexts such as tasklets and timers. | ||
101 | |||
102 | These semantics are fully enforced when CONFIG DEBUG_MUTEXES is enabled. | ||
103 | In addition, the mutex debugging code also implements a number of other | ||
104 | features that make lock debugging easier and faster: | ||
105 | |||
106 | - Uses symbolic names of mutexes, whenever they are printed | ||
107 | in debug output. | ||
108 | - Point-of-acquire tracking, symbolic lookup of function names, | ||
109 | list of all locks held in the system, printout of them. | ||
110 | - Owner tracking. | ||
111 | - Detects self-recursing locks and prints out all relevant info. | ||
112 | - Detects multi-task circular deadlocks and prints out all affected | ||
113 | locks and tasks (and only those tasks). | ||
114 | |||
115 | |||
116 | Interfaces | ||
117 | ---------- | ||
118 | Statically define the mutex: | ||
119 | DEFINE_MUTEX(name); | ||
120 | |||
121 | Dynamically initialize the mutex: | ||
122 | mutex_init(mutex); | ||
123 | |||
124 | Acquire the mutex, uninterruptible: | ||
125 | void mutex_lock(struct mutex *lock); | ||
126 | void mutex_lock_nested(struct mutex *lock, unsigned int subclass); | ||
127 | int mutex_trylock(struct mutex *lock); | ||
128 | |||
129 | Acquire the mutex, interruptible: | ||
130 | int mutex_lock_interruptible_nested(struct mutex *lock, | ||
131 | unsigned int subclass); | ||
132 | int mutex_lock_interruptible(struct mutex *lock); | ||
133 | |||
134 | Acquire the mutex, interruptible, if dec to 0: | ||
135 | int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); | ||
136 | |||
137 | Unlock the mutex: | ||
138 | void mutex_unlock(struct mutex *lock); | ||
139 | |||
140 | Test if the mutex is taken: | ||
141 | int mutex_is_locked(struct mutex *lock); | ||
142 | |||
143 | Disadvantages | ||
144 | ------------- | ||
145 | |||
146 | Unlike its original design and purpose, 'struct mutex' is larger than | ||
147 | most locks in the kernel. E.g: on x86-64 it is 40 bytes, almost twice | ||
148 | as large as 'struct semaphore' (24 bytes) and 8 bytes shy of the | ||
149 | 'struct rw_semaphore' variant. Larger structure sizes mean more CPU | ||
150 | cache and memory footprint. | ||
151 | |||
152 | When to use mutexes | ||
153 | ------------------- | ||
154 | |||
155 | Unless the strict semantics of mutexes are unsuitable and/or the critical | ||
156 | region prevents the lock from being shared, always prefer them to any other | ||
157 | locking primitive. | ||
diff --git a/Documentation/locking/rt-mutex-design.txt b/Documentation/locking/rt-mutex-design.txt new file mode 100644 index 000000000000..8666070d3189 --- /dev/null +++ b/Documentation/locking/rt-mutex-design.txt | |||
@@ -0,0 +1,781 @@ | |||
1 | # | ||
2 | # Copyright (c) 2006 Steven Rostedt | ||
3 | # Licensed under the GNU Free Documentation License, Version 1.2 | ||
4 | # | ||
5 | |||
6 | RT-mutex implementation design | ||
7 | ------------------------------ | ||
8 | |||
9 | This document tries to describe the design of the rtmutex.c implementation. | ||
10 | It doesn't describe the reasons why rtmutex.c exists. For that please see | ||
11 | Documentation/rt-mutex.txt. Although this document does explain problems | ||
12 | that happen without this code, but that is in the concept to understand | ||
13 | what the code actually is doing. | ||
14 | |||
15 | The goal of this document is to help others understand the priority | ||
16 | inheritance (PI) algorithm that is used, as well as reasons for the | ||
17 | decisions that were made to implement PI in the manner that was done. | ||
18 | |||
19 | |||
20 | Unbounded Priority Inversion | ||
21 | ---------------------------- | ||
22 | |||
23 | Priority inversion is when a lower priority process executes while a higher | ||
24 | priority process wants to run. This happens for several reasons, and | ||
25 | most of the time it can't be helped. Anytime a high priority process wants | ||
26 | to use a resource that a lower priority process has (a mutex for example), | ||
27 | the high priority process must wait until the lower priority process is done | ||
28 | with the resource. This is a priority inversion. What we want to prevent | ||
29 | is something called unbounded priority inversion. That is when the high | ||
30 | priority process is prevented from running by a lower priority process for | ||
31 | an undetermined amount of time. | ||
32 | |||
33 | The classic example of unbounded priority inversion is where you have three | ||
34 | processes, let's call them processes A, B, and C, where A is the highest | ||
35 | priority process, C is the lowest, and B is in between. A tries to grab a lock | ||
36 | that C owns and must wait and lets C run to release the lock. But in the | ||
37 | meantime, B executes, and since B is of a higher priority than C, it preempts C, | ||
38 | but by doing so, it is in fact preempting A which is a higher priority process. | ||
39 | Now there's no way of knowing how long A will be sleeping waiting for C | ||
40 | to release the lock, because for all we know, B is a CPU hog and will | ||
41 | never give C a chance to release the lock. This is called unbounded priority | ||
42 | inversion. | ||
43 | |||
44 | Here's a little ASCII art to show the problem. | ||
45 | |||
46 | grab lock L1 (owned by C) | ||
47 | | | ||
48 | A ---+ | ||
49 | C preempted by B | ||
50 | | | ||
51 | C +----+ | ||
52 | |||
53 | B +--------> | ||
54 | B now keeps A from running. | ||
55 | |||
56 | |||
57 | Priority Inheritance (PI) | ||
58 | ------------------------- | ||
59 | |||
60 | There are several ways to solve this issue, but other ways are out of scope | ||
61 | for this document. Here we only discuss PI. | ||
62 | |||
63 | PI is where a process inherits the priority of another process if the other | ||
64 | process blocks on a lock owned by the current process. To make this easier | ||
65 | to understand, let's use the previous example, with processes A, B, and C again. | ||
66 | |||
67 | This time, when A blocks on the lock owned by C, C would inherit the priority | ||
68 | of A. So now if B becomes runnable, it would not preempt C, since C now has | ||
69 | the high priority of A. As soon as C releases the lock, it loses its | ||
70 | inherited priority, and A then can continue with the resource that C had. | ||
71 | |||
72 | Terminology | ||
73 | ----------- | ||
74 | |||
75 | Here I explain some terminology that is used in this document to help describe | ||
76 | the design that is used to implement PI. | ||
77 | |||
78 | PI chain - The PI chain is an ordered series of locks and processes that cause | ||
79 | processes to inherit priorities from a previous process that is | ||
80 | blocked on one of its locks. This is described in more detail | ||
81 | later in this document. | ||
82 | |||
83 | mutex - In this document, to differentiate from locks that implement | ||
84 | PI and spin locks that are used in the PI code, from now on | ||
85 | the PI locks will be called a mutex. | ||
86 | |||
87 | lock - In this document from now on, I will use the term lock when | ||
88 | referring to spin locks that are used to protect parts of the PI | ||
89 | algorithm. These locks disable preemption for UP (when | ||
90 | CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from | ||
91 | entering critical sections simultaneously. | ||
92 | |||
93 | spin lock - Same as lock above. | ||
94 | |||
95 | waiter - A waiter is a struct that is stored on the stack of a blocked | ||
96 | process. Since the scope of the waiter is within the code for | ||
97 | a process being blocked on the mutex, it is fine to allocate | ||
98 | the waiter on the process's stack (local variable). This | ||
99 | structure holds a pointer to the task, as well as the mutex that | ||
100 | the task is blocked on. It also has the plist node structures to | ||
101 | place the task in the waiter_list of a mutex as well as the | ||
102 | pi_list of a mutex owner task (described below). | ||
103 | |||
104 | waiter is sometimes used in reference to the task that is waiting | ||
105 | on a mutex. This is the same as waiter->task. | ||
106 | |||
107 | waiters - A list of processes that are blocked on a mutex. | ||
108 | |||
109 | top waiter - The highest priority process waiting on a specific mutex. | ||
110 | |||
111 | top pi waiter - The highest priority process waiting on one of the mutexes | ||
112 | that a specific process owns. | ||
113 | |||
114 | Note: task and process are used interchangeably in this document, mostly to | ||
115 | differentiate between two processes that are being described together. | ||
116 | |||
117 | |||
118 | PI chain | ||
119 | -------- | ||
120 | |||
121 | The PI chain is a list of processes and mutexes that may cause priority | ||
122 | inheritance to take place. Multiple chains may converge, but a chain | ||
123 | would never diverge, since a process can't be blocked on more than one | ||
124 | mutex at a time. | ||
125 | |||
126 | Example: | ||
127 | |||
128 | Process: A, B, C, D, E | ||
129 | Mutexes: L1, L2, L3, L4 | ||
130 | |||
131 | A owns: L1 | ||
132 | B blocked on L1 | ||
133 | B owns L2 | ||
134 | C blocked on L2 | ||
135 | C owns L3 | ||
136 | D blocked on L3 | ||
137 | D owns L4 | ||
138 | E blocked on L4 | ||
139 | |||
140 | The chain would be: | ||
141 | |||
142 | E->L4->D->L3->C->L2->B->L1->A | ||
143 | |||
144 | To show where two chains merge, we could add another process F and | ||
145 | another mutex L5 where B owns L5 and F is blocked on mutex L5. | ||
146 | |||
147 | The chain for F would be: | ||
148 | |||
149 | F->L5->B->L1->A | ||
150 | |||
151 | Since a process may own more than one mutex, but never be blocked on more than | ||
152 | one, the chains merge. | ||
153 | |||
154 | Here we show both chains: | ||
155 | |||
156 | E->L4->D->L3->C->L2-+ | ||
157 | | | ||
158 | +->B->L1->A | ||
159 | | | ||
160 | F->L5-+ | ||
161 | |||
162 | For PI to work, the processes at the right end of these chains (or we may | ||
163 | also call it the Top of the chain) must be equal to or higher in priority | ||
164 | than the processes to the left or below in the chain. | ||
165 | |||
166 | Also since a mutex may have more than one process blocked on it, we can | ||
167 | have multiple chains merge at mutexes. If we add another process G that is | ||
168 | blocked on mutex L2: | ||
169 | |||
170 | G->L2->B->L1->A | ||
171 | |||
172 | And once again, to show how this can grow I will show the merging chains | ||
173 | again. | ||
174 | |||
175 | E->L4->D->L3->C-+ | ||
176 | +->L2-+ | ||
177 | | | | ||
178 | G-+ +->B->L1->A | ||
179 | | | ||
180 | F->L5-+ | ||
181 | |||
182 | |||
183 | Plist | ||
184 | ----- | ||
185 | |||
186 | Before I go further and talk about how the PI chain is stored through lists | ||
187 | on both mutexes and processes, I'll explain the plist. This is similar to | ||
188 | the struct list_head functionality that is already in the kernel. | ||
189 | The implementation of plist is out of scope for this document, but it is | ||
190 | very important to understand what it does. | ||
191 | |||
192 | There are a few differences between plist and list, the most important one | ||
193 | being that plist is a priority sorted linked list. This means that the | ||
194 | priorities of the plist are sorted, such that it takes O(1) to retrieve the | ||
195 | highest priority item in the list. Obviously this is useful to store processes | ||
196 | based on their priorities. | ||
197 | |||
198 | Another difference, which is important for implementation, is that, unlike | ||
199 | list, the head of the list is a different element than the nodes of a list. | ||
200 | So the head of the list is declared as struct plist_head and nodes that will | ||
201 | be added to the list are declared as struct plist_node. | ||
202 | |||
203 | |||
204 | Mutex Waiter List | ||
205 | ----------------- | ||
206 | |||
207 | Every mutex keeps track of all the waiters that are blocked on itself. The mutex | ||
208 | has a plist to store these waiters by priority. This list is protected by | ||
209 | a spin lock that is located in the struct of the mutex. This lock is called | ||
210 | wait_lock. Since the modification of the waiter list is never done in | ||
211 | interrupt context, the wait_lock can be taken without disabling interrupts. | ||
212 | |||
213 | |||
214 | Task PI List | ||
215 | ------------ | ||
216 | |||
217 | To keep track of the PI chains, each process has its own PI list. This is | ||
218 | a list of all top waiters of the mutexes that are owned by the process. | ||
219 | Note that this list only holds the top waiters and not all waiters that are | ||
220 | blocked on mutexes owned by the process. | ||
221 | |||
222 | The top of the task's PI list is always the highest priority task that | ||
223 | is waiting on a mutex that is owned by the task. So if the task has | ||
224 | inherited a priority, it will always be the priority of the task that is | ||
225 | at the top of this list. | ||
226 | |||
227 | This list is stored in the task structure of a process as a plist called | ||
228 | pi_list. This list is protected by a spin lock also in the task structure, | ||
229 | called pi_lock. This lock may also be taken in interrupt context, so when | ||
230 | locking the pi_lock, interrupts must be disabled. | ||
231 | |||
232 | |||
233 | Depth of the PI Chain | ||
234 | --------------------- | ||
235 | |||
236 | The maximum depth of the PI chain is not dynamic, and could actually be | ||
237 | defined. But is very complex to figure it out, since it depends on all | ||
238 | the nesting of mutexes. Let's look at the example where we have 3 mutexes, | ||
239 | L1, L2, and L3, and four separate functions func1, func2, func3 and func4. | ||
240 | The following shows a locking order of L1->L2->L3, but may not actually | ||
241 | be directly nested that way. | ||
242 | |||
243 | void func1(void) | ||
244 | { | ||
245 | mutex_lock(L1); | ||
246 | |||
247 | /* do anything */ | ||
248 | |||
249 | mutex_unlock(L1); | ||
250 | } | ||
251 | |||
252 | void func2(void) | ||
253 | { | ||
254 | mutex_lock(L1); | ||
255 | mutex_lock(L2); | ||
256 | |||
257 | /* do something */ | ||
258 | |||
259 | mutex_unlock(L2); | ||
260 | mutex_unlock(L1); | ||
261 | } | ||
262 | |||
263 | void func3(void) | ||
264 | { | ||
265 | mutex_lock(L2); | ||
266 | mutex_lock(L3); | ||
267 | |||
268 | /* do something else */ | ||
269 | |||
270 | mutex_unlock(L3); | ||
271 | mutex_unlock(L2); | ||
272 | } | ||
273 | |||
274 | void func4(void) | ||
275 | { | ||
276 | mutex_lock(L3); | ||
277 | |||
278 | /* do something again */ | ||
279 | |||
280 | mutex_unlock(L3); | ||
281 | } | ||
282 | |||
283 | Now we add 4 processes that run each of these functions separately. | ||
284 | Processes A, B, C, and D which run functions func1, func2, func3 and func4 | ||
285 | respectively, and such that D runs first and A last. With D being preempted | ||
286 | in func4 in the "do something again" area, we have a locking that follows: | ||
287 | |||
288 | D owns L3 | ||
289 | C blocked on L3 | ||
290 | C owns L2 | ||
291 | B blocked on L2 | ||
292 | B owns L1 | ||
293 | A blocked on L1 | ||
294 | |||
295 | And thus we have the chain A->L1->B->L2->C->L3->D. | ||
296 | |||
297 | This gives us a PI depth of 4 (four processes), but looking at any of the | ||
298 | functions individually, it seems as though they only have at most a locking | ||
299 | depth of two. So, although the locking depth is defined at compile time, | ||
300 | it still is very difficult to find the possibilities of that depth. | ||
301 | |||
302 | Now since mutexes can be defined by user-land applications, we don't want a DOS | ||
303 | type of application that nests large amounts of mutexes to create a large | ||
304 | PI chain, and have the code holding spin locks while looking at a large | ||
305 | amount of data. So to prevent this, the implementation not only implements | ||
306 | a maximum lock depth, but also only holds at most two different locks at a | ||
307 | time, as it walks the PI chain. More about this below. | ||
308 | |||
309 | |||
310 | Mutex owner and flags | ||
311 | --------------------- | ||
312 | |||
313 | The mutex structure contains a pointer to the owner of the mutex. If the | ||
314 | mutex is not owned, this owner is set to NULL. Since all architectures | ||
315 | have the task structure on at least a four byte alignment (and if this is | ||
316 | not true, the rtmutex.c code will be broken!), this allows for the two | ||
317 | least significant bits to be used as flags. This part is also described | ||
318 | in Documentation/rt-mutex.txt, but will also be briefly described here. | ||
319 | |||
320 | Bit 0 is used as the "Pending Owner" flag. This is described later. | ||
321 | Bit 1 is used as the "Has Waiters" flags. This is also described later | ||
322 | in more detail, but is set whenever there are waiters on a mutex. | ||
323 | |||
324 | |||
325 | cmpxchg Tricks | ||
326 | -------------- | ||
327 | |||
328 | Some architectures implement an atomic cmpxchg (Compare and Exchange). This | ||
329 | is used (when applicable) to keep the fast path of grabbing and releasing | ||
330 | mutexes short. | ||
331 | |||
332 | cmpxchg is basically the following function performed atomically: | ||
333 | |||
334 | unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) | ||
335 | { | ||
336 | unsigned long T = *A; | ||
337 | if (*A == *B) { | ||
338 | *A = *C; | ||
339 | } | ||
340 | return T; | ||
341 | } | ||
342 | #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) | ||
343 | |||
344 | This is really nice to have, since it allows you to only update a variable | ||
345 | if the variable is what you expect it to be. You know if it succeeded if | ||
346 | the return value (the old value of A) is equal to B. | ||
347 | |||
348 | The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If | ||
349 | the architecture does not support CMPXCHG, then this macro is simply set | ||
350 | to fail every time. But if CMPXCHG is supported, then this will | ||
351 | help out extremely to keep the fast path short. | ||
352 | |||
353 | The use of rt_mutex_cmpxchg with the flags in the owner field help optimize | ||
354 | the system for architectures that support it. This will also be explained | ||
355 | later in this document. | ||
356 | |||
357 | |||
358 | Priority adjustments | ||
359 | -------------------- | ||
360 | |||
361 | The implementation of the PI code in rtmutex.c has several places that a | ||
362 | process must adjust its priority. With the help of the pi_list of a | ||
363 | process this is rather easy to know what needs to be adjusted. | ||
364 | |||
365 | The functions implementing the task adjustments are rt_mutex_adjust_prio, | ||
366 | __rt_mutex_adjust_prio (same as the former, but expects the task pi_lock | ||
367 | to already be taken), rt_mutex_getprio, and rt_mutex_setprio. | ||
368 | |||
369 | rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio. | ||
370 | |||
371 | rt_mutex_getprio returns the priority that the task should have. Either the | ||
372 | task's own normal priority, or if a process of a higher priority is waiting on | ||
373 | a mutex owned by the task, then that higher priority should be returned. | ||
374 | Since the pi_list of a task holds an order by priority list of all the top | ||
375 | waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs | ||
376 | to compare the top pi waiter to its own normal priority, and return the higher | ||
377 | priority back. | ||
378 | |||
379 | (Note: if looking at the code, you will notice that the lower number of | ||
380 | prio is returned. This is because the prio field in the task structure | ||
381 | is an inverse order of the actual priority. So a "prio" of 5 is | ||
382 | of higher priority than a "prio" of 10.) | ||
383 | |||
384 | __rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the | ||
385 | result does not equal the task's current priority, then rt_mutex_setprio | ||
386 | is called to adjust the priority of the task to the new priority. | ||
387 | Note that rt_mutex_setprio is defined in kernel/sched/core.c to implement the | ||
388 | actual change in priority. | ||
389 | |||
390 | It is interesting to note that __rt_mutex_adjust_prio can either increase | ||
391 | or decrease the priority of the task. In the case that a higher priority | ||
392 | process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio | ||
393 | would increase/boost the task's priority. But if a higher priority task | ||
394 | were for some reason to leave the mutex (timeout or signal), this same function | ||
395 | would decrease/unboost the priority of the task. That is because the pi_list | ||
396 | always contains the highest priority task that is waiting on a mutex owned | ||
397 | by the task, so we only need to compare the priority of that top pi waiter | ||
398 | to the normal priority of the given task. | ||
399 | |||
400 | |||
401 | High level overview of the PI chain walk | ||
402 | ---------------------------------------- | ||
403 | |||
404 | The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. | ||
405 | |||
406 | The implementation has gone through several iterations, and has ended up | ||
407 | with what we believe is the best. It walks the PI chain by only grabbing | ||
408 | at most two locks at a time, and is very efficient. | ||
409 | |||
410 | The rt_mutex_adjust_prio_chain can be used either to boost or lower process | ||
411 | priorities. | ||
412 | |||
413 | rt_mutex_adjust_prio_chain is called with a task to be checked for PI | ||
414 | (de)boosting (the owner of a mutex that a process is blocking on), a flag to | ||
415 | check for deadlocking, the mutex that the task owns, and a pointer to a waiter | ||
416 | that is the process's waiter struct that is blocked on the mutex (although this | ||
417 | parameter may be NULL for deboosting). | ||
418 | |||
419 | For this explanation, I will not mention deadlock detection. This explanation | ||
420 | will try to stay at a high level. | ||
421 | |||
422 | When this function is called, there are no locks held. That also means | ||
423 | that the state of the owner and lock can change when entered into this function. | ||
424 | |||
425 | Before this function is called, the task has already had rt_mutex_adjust_prio | ||
426 | performed on it. This means that the task is set to the priority that it | ||
427 | should be at, but the plist nodes of the task's waiter have not been updated | ||
428 | with the new priorities, and that this task may not be in the proper locations | ||
429 | in the pi_lists and wait_lists that the task is blocked on. This function | ||
430 | solves all that. | ||
431 | |||
432 | A loop is entered, where task is the owner to be checked for PI changes that | ||
433 | was passed by parameter (for the first iteration). The pi_lock of this task is | ||
434 | taken to prevent any more changes to the pi_list of the task. This also | ||
435 | prevents new tasks from completing the blocking on a mutex that is owned by this | ||
436 | task. | ||
437 | |||
438 | If the task is not blocked on a mutex then the loop is exited. We are at | ||
439 | the top of the PI chain. | ||
440 | |||
441 | A check is now done to see if the original waiter (the process that is blocked | ||
442 | on the current mutex) is the top pi waiter of the task. That is, is this | ||
443 | waiter on the top of the task's pi_list. If it is not, it either means that | ||
444 | there is another process higher in priority that is blocked on one of the | ||
445 | mutexes that the task owns, or that the waiter has just woken up via a signal | ||
446 | or timeout and has left the PI chain. In either case, the loop is exited, since | ||
447 | we don't need to do any more changes to the priority of the current task, or any | ||
448 | task that owns a mutex that this current task is waiting on. A priority chain | ||
449 | walk is only needed when a new top pi waiter is made to a task. | ||
450 | |||
451 | The next check sees if the task's waiter plist node has the priority equal to | ||
452 | the priority the task is set at. If they are equal, then we are done with | ||
453 | the loop. Remember that the function started with the priority of the | ||
454 | task adjusted, but the plist nodes that hold the task in other processes | ||
455 | pi_lists have not been adjusted. | ||
456 | |||
457 | Next, we look at the mutex that the task is blocked on. The mutex's wait_lock | ||
458 | is taken. This is done by a spin_trylock, because the locking order of the | ||
459 | pi_lock and wait_lock goes in the opposite direction. If we fail to grab the | ||
460 | lock, the pi_lock is released, and we restart the loop. | ||
461 | |||
462 | Now that we have both the pi_lock of the task as well as the wait_lock of | ||
463 | the mutex the task is blocked on, we update the task's waiter's plist node | ||
464 | that is located on the mutex's wait_list. | ||
465 | |||
466 | Now we release the pi_lock of the task. | ||
467 | |||
468 | Next the owner of the mutex has its pi_lock taken, so we can update the | ||
469 | task's entry in the owner's pi_list. If the task is the highest priority | ||
470 | process on the mutex's wait_list, then we remove the previous top waiter | ||
471 | from the owner's pi_list, and replace it with the task. | ||
472 | |||
473 | Note: It is possible that the task was the current top waiter on the mutex, | ||
474 | in which case the task is not yet on the pi_list of the waiter. This | ||
475 | is OK, since plist_del does nothing if the plist node is not on any | ||
476 | list. | ||
477 | |||
478 | If the task was not the top waiter of the mutex, but it was before we | ||
479 | did the priority updates, that means we are deboosting/lowering the | ||
480 | task. In this case, the task is removed from the pi_list of the owner, | ||
481 | and the new top waiter is added. | ||
482 | |||
483 | Lastly, we unlock both the pi_lock of the task, as well as the mutex's | ||
484 | wait_lock, and continue the loop again. On the next iteration of the | ||
485 | loop, the previous owner of the mutex will be the task that will be | ||
486 | processed. | ||
487 | |||
488 | Note: One might think that the owner of this mutex might have changed | ||
489 | since we just grab the mutex's wait_lock. And one could be right. | ||
490 | The important thing to remember is that the owner could not have | ||
491 | become the task that is being processed in the PI chain, since | ||
492 | we have taken that task's pi_lock at the beginning of the loop. | ||
493 | So as long as there is an owner of this mutex that is not the same | ||
494 | process as the tasked being worked on, we are OK. | ||
495 | |||
496 | Looking closely at the code, one might be confused. The check for the | ||
497 | end of the PI chain is when the task isn't blocked on anything or the | ||
498 | task's waiter structure "task" element is NULL. This check is | ||
499 | protected only by the task's pi_lock. But the code to unlock the mutex | ||
500 | sets the task's waiter structure "task" element to NULL with only | ||
501 | the protection of the mutex's wait_lock, which was not taken yet. | ||
502 | Isn't this a race condition if the task becomes the new owner? | ||
503 | |||
504 | The answer is No! The trick is the spin_trylock of the mutex's | ||
505 | wait_lock. If we fail that lock, we release the pi_lock of the | ||
506 | task and continue the loop, doing the end of PI chain check again. | ||
507 | |||
508 | In the code to release the lock, the wait_lock of the mutex is held | ||
509 | the entire time, and it is not let go when we grab the pi_lock of the | ||
510 | new owner of the mutex. So if the switch of a new owner were to happen | ||
511 | after the check for end of the PI chain and the grabbing of the | ||
512 | wait_lock, the unlocking code would spin on the new owner's pi_lock | ||
513 | but never give up the wait_lock. So the PI chain loop is guaranteed to | ||
514 | fail the spin_trylock on the wait_lock, release the pi_lock, and | ||
515 | try again. | ||
516 | |||
517 | If you don't quite understand the above, that's OK. You don't have to, | ||
518 | unless you really want to make a proof out of it ;) | ||
519 | |||
520 | |||
521 | Pending Owners and Lock stealing | ||
522 | -------------------------------- | ||
523 | |||
524 | One of the flags in the owner field of the mutex structure is "Pending Owner". | ||
525 | What this means is that an owner was chosen by the process releasing the | ||
526 | mutex, but that owner has yet to wake up and actually take the mutex. | ||
527 | |||
528 | Why is this important? Why can't we just give the mutex to another process | ||
529 | and be done with it? | ||
530 | |||
531 | The PI code is to help with real-time processes, and to let the highest | ||
532 | priority process run as long as possible with little latencies and delays. | ||
533 | If a high priority process owns a mutex that a lower priority process is | ||
534 | blocked on, when the mutex is released it would be given to the lower priority | ||
535 | process. What if the higher priority process wants to take that mutex again. | ||
536 | The high priority process would fail to take that mutex that it just gave up | ||
537 | and it would need to boost the lower priority process to run with full | ||
538 | latency of that critical section (since the low priority process just entered | ||
539 | it). | ||
540 | |||
541 | There's no reason a high priority process that gives up a mutex should be | ||
542 | penalized if it tries to take that mutex again. If the new owner of the | ||
543 | mutex has not woken up yet, there's no reason that the higher priority process | ||
544 | could not take that mutex away. | ||
545 | |||
546 | To solve this, we introduced Pending Ownership and Lock Stealing. When a | ||
547 | new process is given a mutex that it was blocked on, it is only given | ||
548 | pending ownership. This means that it's the new owner, unless a higher | ||
549 | priority process comes in and tries to grab that mutex. If a higher priority | ||
550 | process does come along and wants that mutex, we let the higher priority | ||
551 | process "steal" the mutex from the pending owner (only if it is still pending) | ||
552 | and continue with the mutex. | ||
553 | |||
554 | |||
555 | Taking of a mutex (The walk through) | ||
556 | ------------------------------------ | ||
557 | |||
558 | OK, now let's take a look at the detailed walk through of what happens when | ||
559 | taking a mutex. | ||
560 | |||
561 | The first thing that is tried is the fast taking of the mutex. This is | ||
562 | done when we have CMPXCHG enabled (otherwise the fast taking automatically | ||
563 | fails). Only when the owner field of the mutex is NULL can the lock be | ||
564 | taken with the CMPXCHG and nothing else needs to be done. | ||
565 | |||
566 | If there is contention on the lock, whether it is owned or pending owner | ||
567 | we go about the slow path (rt_mutex_slowlock). | ||
568 | |||
569 | The slow path function is where the task's waiter structure is created on | ||
570 | the stack. This is because the waiter structure is only needed for the | ||
571 | scope of this function. The waiter structure holds the nodes to store | ||
572 | the task on the wait_list of the mutex, and if need be, the pi_list of | ||
573 | the owner. | ||
574 | |||
575 | The wait_lock of the mutex is taken since the slow path of unlocking the | ||
576 | mutex also takes this lock. | ||
577 | |||
578 | We then call try_to_take_rt_mutex. This is where the architecture that | ||
579 | does not implement CMPXCHG would always grab the lock (if there's no | ||
580 | contention). | ||
581 | |||
582 | try_to_take_rt_mutex is used every time the task tries to grab a mutex in the | ||
583 | slow path. The first thing that is done here is an atomic setting of | ||
584 | the "Has Waiters" flag of the mutex's owner field. Yes, this could really | ||
585 | be false, because if the mutex has no owner, there are no waiters and | ||
586 | the current task also won't have any waiters. But we don't have the lock | ||
587 | yet, so we assume we are going to be a waiter. The reason for this is to | ||
588 | play nice for those architectures that do have CMPXCHG. By setting this flag | ||
589 | now, the owner of the mutex can't release the mutex without going into the | ||
590 | slow unlock path, and it would then need to grab the wait_lock, which this | ||
591 | code currently holds. So setting the "Has Waiters" flag forces the owner | ||
592 | to synchronize with this code. | ||
593 | |||
594 | Now that we know that we can't have any races with the owner releasing the | ||
595 | mutex, we check to see if we can take the ownership. This is done if the | ||
596 | mutex doesn't have a owner, or if we can steal the mutex from a pending | ||
597 | owner. Let's look at the situations we have here. | ||
598 | |||
599 | 1) Has owner that is pending | ||
600 | ---------------------------- | ||
601 | |||
602 | The mutex has a owner, but it hasn't woken up and the mutex flag | ||
603 | "Pending Owner" is set. The first check is to see if the owner isn't the | ||
604 | current task. This is because this function is also used for the pending | ||
605 | owner to grab the mutex. When a pending owner wakes up, it checks to see | ||
606 | if it can take the mutex, and this is done if the owner is already set to | ||
607 | itself. If so, we succeed and leave the function, clearing the "Pending | ||
608 | Owner" bit. | ||
609 | |||
610 | If the pending owner is not current, we check to see if the current priority is | ||
611 | higher than the pending owner. If not, we fail the function and return. | ||
612 | |||
613 | There's also something special about a pending owner. That is a pending owner | ||
614 | is never blocked on a mutex. So there is no PI chain to worry about. It also | ||
615 | means that if the mutex doesn't have any waiters, there's no accounting needed | ||
616 | to update the pending owner's pi_list, since we only worry about processes | ||
617 | blocked on the current mutex. | ||
618 | |||
619 | If there are waiters on this mutex, and we just stole the ownership, we need | ||
620 | to take the top waiter, remove it from the pi_list of the pending owner, and | ||
621 | add it to the current pi_list. Note that at this moment, the pending owner | ||
622 | is no longer on the list of waiters. This is fine, since the pending owner | ||
623 | would add itself back when it realizes that it had the ownership stolen | ||
624 | from itself. When the pending owner tries to grab the mutex, it will fail | ||
625 | in try_to_take_rt_mutex if the owner field points to another process. | ||
626 | |||
627 | 2) No owner | ||
628 | ----------- | ||
629 | |||
630 | If there is no owner (or we successfully stole the lock), we set the owner | ||
631 | of the mutex to current, and set the flag of "Has Waiters" if the current | ||
632 | mutex actually has waiters, or we clear the flag if it doesn't. See, it was | ||
633 | OK that we set that flag early, since now it is cleared. | ||
634 | |||
635 | 3) Failed to grab ownership | ||
636 | --------------------------- | ||
637 | |||
638 | The most interesting case is when we fail to take ownership. This means that | ||
639 | there exists an owner, or there's a pending owner with equal or higher | ||
640 | priority than the current task. | ||
641 | |||
642 | We'll continue on the failed case. | ||
643 | |||
644 | If the mutex has a timeout, we set up a timer to go off to break us out | ||
645 | of this mutex if we failed to get it after a specified amount of time. | ||
646 | |||
647 | Now we enter a loop that will continue to try to take ownership of the mutex, or | ||
648 | fail from a timeout or signal. | ||
649 | |||
650 | Once again we try to take the mutex. This will usually fail the first time | ||
651 | in the loop, since it had just failed to get the mutex. But the second time | ||
652 | in the loop, this would likely succeed, since the task would likely be | ||
653 | the pending owner. | ||
654 | |||
655 | If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done | ||
656 | here. | ||
657 | |||
658 | The waiter structure has a "task" field that points to the task that is blocked | ||
659 | on the mutex. This field can be NULL the first time it goes through the loop | ||
660 | or if the task is a pending owner and had its mutex stolen. If the "task" | ||
661 | field is NULL then we need to set up the accounting for it. | ||
662 | |||
663 | Task blocks on mutex | ||
664 | -------------------- | ||
665 | |||
666 | The accounting of a mutex and process is done with the waiter structure of | ||
667 | the process. The "task" field is set to the process, and the "lock" field | ||
668 | to the mutex. The plist nodes are initialized to the processes current | ||
669 | priority. | ||
670 | |||
671 | Since the wait_lock was taken at the entry of the slow lock, we can safely | ||
672 | add the waiter to the wait_list. If the current process is the highest | ||
673 | priority process currently waiting on this mutex, then we remove the | ||
674 | previous top waiter process (if it exists) from the pi_list of the owner, | ||
675 | and add the current process to that list. Since the pi_list of the owner | ||
676 | has changed, we call rt_mutex_adjust_prio on the owner to see if the owner | ||
677 | should adjust its priority accordingly. | ||
678 | |||
679 | If the owner is also blocked on a lock, and had its pi_list changed | ||
680 | (or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead | ||
681 | and run rt_mutex_adjust_prio_chain on the owner, as described earlier. | ||
682 | |||
683 | Now all locks are released, and if the current process is still blocked on a | ||
684 | mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). | ||
685 | |||
686 | Waking up in the loop | ||
687 | --------------------- | ||
688 | |||
689 | The schedule can then wake up for a few reasons. | ||
690 | 1) we were given pending ownership of the mutex. | ||
691 | 2) we received a signal and was TASK_INTERRUPTIBLE | ||
692 | 3) we had a timeout and was TASK_INTERRUPTIBLE | ||
693 | |||
694 | In any of these cases, we continue the loop and once again try to grab the | ||
695 | ownership of the mutex. If we succeed, we exit the loop, otherwise we continue | ||
696 | and on signal and timeout, will exit the loop, or if we had the mutex stolen | ||
697 | we just simply add ourselves back on the lists and go back to sleep. | ||
698 | |||
699 | Note: For various reasons, because of timeout and signals, the steal mutex | ||
700 | algorithm needs to be careful. This is because the current process is | ||
701 | still on the wait_list. And because of dynamic changing of priorities, | ||
702 | especially on SCHED_OTHER tasks, the current process can be the | ||
703 | highest priority task on the wait_list. | ||
704 | |||
705 | Failed to get mutex on Timeout or Signal | ||
706 | ---------------------------------------- | ||
707 | |||
708 | If a timeout or signal occurred, the waiter's "task" field would not be | ||
709 | NULL and the task needs to be taken off the wait_list of the mutex and perhaps | ||
710 | pi_list of the owner. If this process was a high priority process, then | ||
711 | the rt_mutex_adjust_prio_chain needs to be executed again on the owner, | ||
712 | but this time it will be lowering the priorities. | ||
713 | |||
714 | |||
715 | Unlocking the Mutex | ||
716 | ------------------- | ||
717 | |||
718 | The unlocking of a mutex also has a fast path for those architectures with | ||
719 | CMPXCHG. Since the taking of a mutex on contention always sets the | ||
720 | "Has Waiters" flag of the mutex's owner, we use this to know if we need to | ||
721 | take the slow path when unlocking the mutex. If the mutex doesn't have any | ||
722 | waiters, the owner field of the mutex would equal the current process and | ||
723 | the mutex can be unlocked by just replacing the owner field with NULL. | ||
724 | |||
725 | If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), | ||
726 | the slow unlock path is taken. | ||
727 | |||
728 | The first thing done in the slow unlock path is to take the wait_lock of the | ||
729 | mutex. This synchronizes the locking and unlocking of the mutex. | ||
730 | |||
731 | A check is made to see if the mutex has waiters or not. On architectures that | ||
732 | do not have CMPXCHG, this is the location that the owner of the mutex will | ||
733 | determine if a waiter needs to be awoken or not. On architectures that | ||
734 | do have CMPXCHG, that check is done in the fast path, but it is still needed | ||
735 | in the slow path too. If a waiter of a mutex woke up because of a signal | ||
736 | or timeout between the time the owner failed the fast path CMPXCHG check and | ||
737 | the grabbing of the wait_lock, the mutex may not have any waiters, thus the | ||
738 | owner still needs to make this check. If there are no waiters then the mutex | ||
739 | owner field is set to NULL, the wait_lock is released and nothing more is | ||
740 | needed. | ||
741 | |||
742 | If there are waiters, then we need to wake one up and give that waiter | ||
743 | pending ownership. | ||
744 | |||
745 | On the wake up code, the pi_lock of the current owner is taken. The top | ||
746 | waiter of the lock is found and removed from the wait_list of the mutex | ||
747 | as well as the pi_list of the current owner. The task field of the new | ||
748 | pending owner's waiter structure is set to NULL, and the owner field of the | ||
749 | mutex is set to the new owner with the "Pending Owner" bit set, as well | ||
750 | as the "Has Waiters" bit if there still are other processes blocked on the | ||
751 | mutex. | ||
752 | |||
753 | The pi_lock of the previous owner is released, and the new pending owner's | ||
754 | pi_lock is taken. Remember that this is the trick to prevent the race | ||
755 | condition in rt_mutex_adjust_prio_chain from adding itself as a waiter | ||
756 | on the mutex. | ||
757 | |||
758 | We now clear the "pi_blocked_on" field of the new pending owner, and if | ||
759 | the mutex still has waiters pending, we add the new top waiter to the pi_list | ||
760 | of the pending owner. | ||
761 | |||
762 | Finally we unlock the pi_lock of the pending owner and wake it up. | ||
763 | |||
764 | |||
765 | Contact | ||
766 | ------- | ||
767 | |||
768 | For updates on this document, please email Steven Rostedt <rostedt@goodmis.org> | ||
769 | |||
770 | |||
771 | Credits | ||
772 | ------- | ||
773 | |||
774 | Author: Steven Rostedt <rostedt@goodmis.org> | ||
775 | |||
776 | Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap | ||
777 | |||
778 | Updates | ||
779 | ------- | ||
780 | |||
781 | This document was originally written for 2.6.17-rc3-mm1 | ||
diff --git a/Documentation/locking/rt-mutex.txt b/Documentation/locking/rt-mutex.txt new file mode 100644 index 000000000000..243393d882ee --- /dev/null +++ b/Documentation/locking/rt-mutex.txt | |||
@@ -0,0 +1,79 @@ | |||
1 | RT-mutex subsystem with PI support | ||
2 | ---------------------------------- | ||
3 | |||
4 | RT-mutexes with priority inheritance are used to support PI-futexes, | ||
5 | which enable pthread_mutex_t priority inheritance attributes | ||
6 | (PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details | ||
7 | about PI-futexes.] | ||
8 | |||
9 | This technology was developed in the -rt tree and streamlined for | ||
10 | pthread_mutex support. | ||
11 | |||
12 | Basic principles: | ||
13 | ----------------- | ||
14 | |||
15 | RT-mutexes extend the semantics of simple mutexes by the priority | ||
16 | inheritance protocol. | ||
17 | |||
18 | A low priority owner of a rt-mutex inherits the priority of a higher | ||
19 | priority waiter until the rt-mutex is released. If the temporarily | ||
20 | boosted owner blocks on a rt-mutex itself it propagates the priority | ||
21 | boosting to the owner of the other rt_mutex it gets blocked on. The | ||
22 | priority boosting is immediately removed once the rt_mutex has been | ||
23 | unlocked. | ||
24 | |||
25 | This approach allows us to shorten the block of high-prio tasks on | ||
26 | mutexes which protect shared resources. Priority inheritance is not a | ||
27 | magic bullet for poorly designed applications, but it allows | ||
28 | well-designed applications to use userspace locks in critical parts of | ||
29 | an high priority thread, without losing determinism. | ||
30 | |||
31 | The enqueueing of the waiters into the rtmutex waiter list is done in | ||
32 | priority order. For same priorities FIFO order is chosen. For each | ||
33 | rtmutex, only the top priority waiter is enqueued into the owner's | ||
34 | priority waiters list. This list too queues in priority order. Whenever | ||
35 | the top priority waiter of a task changes (for example it timed out or | ||
36 | got a signal), the priority of the owner task is readjusted. [The | ||
37 | priority enqueueing is handled by "plists", see include/linux/plist.h | ||
38 | for more details.] | ||
39 | |||
40 | RT-mutexes are optimized for fastpath operations and have no internal | ||
41 | locking overhead when locking an uncontended mutex or unlocking a mutex | ||
42 | without waiters. The optimized fastpath operations require cmpxchg | ||
43 | support. [If that is not available then the rt-mutex internal spinlock | ||
44 | is used] | ||
45 | |||
46 | The state of the rt-mutex is tracked via the owner field of the rt-mutex | ||
47 | structure: | ||
48 | |||
49 | rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1 | ||
50 | are used to keep track of the "owner is pending" and "rtmutex has | ||
51 | waiters" state. | ||
52 | |||
53 | owner bit1 bit0 | ||
54 | NULL 0 0 mutex is free (fast acquire possible) | ||
55 | NULL 0 1 invalid state | ||
56 | NULL 1 0 Transitional state* | ||
57 | NULL 1 1 invalid state | ||
58 | taskpointer 0 0 mutex is held (fast release possible) | ||
59 | taskpointer 0 1 task is pending owner | ||
60 | taskpointer 1 0 mutex is held and has waiters | ||
61 | taskpointer 1 1 task is pending owner and mutex has waiters | ||
62 | |||
63 | Pending-ownership handling is a performance optimization: | ||
64 | pending-ownership is assigned to the first (highest priority) waiter of | ||
65 | the mutex, when the mutex is released. The thread is woken up and once | ||
66 | it starts executing it can acquire the mutex. Until the mutex is taken | ||
67 | by it (bit 0 is cleared) a competing higher priority thread can "steal" | ||
68 | the mutex which puts the woken up thread back on the waiters list. | ||
69 | |||
70 | The pending-ownership optimization is especially important for the | ||
71 | uninterrupted workflow of high-prio tasks which repeatedly | ||
72 | takes/releases locks that have lower-prio waiters. Without this | ||
73 | optimization the higher-prio thread would ping-pong to the lower-prio | ||
74 | task [because at unlock time we always assign a new owner]. | ||
75 | |||
76 | (*) The "mutex has waiters" bit gets set to take the lock. If the lock | ||
77 | doesn't already have an owner, this bit is quickly cleared if there are | ||
78 | no waiters. So this is a transitional state to synchronize with looking | ||
79 | at the owner field of the mutex and the mutex owner releasing the lock. | ||
diff --git a/Documentation/locking/spinlocks.txt b/Documentation/locking/spinlocks.txt new file mode 100644 index 000000000000..ff35e40bdf5b --- /dev/null +++ b/Documentation/locking/spinlocks.txt | |||
@@ -0,0 +1,167 @@ | |||
1 | Lesson 1: Spin locks | ||
2 | |||
3 | The most basic primitive for locking is spinlock. | ||
4 | |||
5 | static DEFINE_SPINLOCK(xxx_lock); | ||
6 | |||
7 | unsigned long flags; | ||
8 | |||
9 | spin_lock_irqsave(&xxx_lock, flags); | ||
10 | ... critical section here .. | ||
11 | spin_unlock_irqrestore(&xxx_lock, flags); | ||
12 | |||
13 | The above is always safe. It will disable interrupts _locally_, but the | ||
14 | spinlock itself will guarantee the global lock, so it will guarantee that | ||
15 | there is only one thread-of-control within the region(s) protected by that | ||
16 | lock. This works well even under UP also, so the code does _not_ need to | ||
17 | worry about UP vs SMP issues: the spinlocks work correctly under both. | ||
18 | |||
19 | NOTE! Implications of spin_locks for memory are further described in: | ||
20 | |||
21 | Documentation/memory-barriers.txt | ||
22 | (5) LOCK operations. | ||
23 | (6) UNLOCK operations. | ||
24 | |||
25 | The above is usually pretty simple (you usually need and want only one | ||
26 | spinlock for most things - using more than one spinlock can make things a | ||
27 | lot more complex and even slower and is usually worth it only for | ||
28 | sequences that you _know_ need to be split up: avoid it at all cost if you | ||
29 | aren't sure). | ||
30 | |||
31 | This is really the only really hard part about spinlocks: once you start | ||
32 | using spinlocks they tend to expand to areas you might not have noticed | ||
33 | before, because you have to make sure the spinlocks correctly protect the | ||
34 | shared data structures _everywhere_ they are used. The spinlocks are most | ||
35 | easily added to places that are completely independent of other code (for | ||
36 | example, internal driver data structures that nobody else ever touches). | ||
37 | |||
38 | NOTE! The spin-lock is safe only when you _also_ use the lock itself | ||
39 | to do locking across CPU's, which implies that EVERYTHING that | ||
40 | touches a shared variable has to agree about the spinlock they want | ||
41 | to use. | ||
42 | |||
43 | ---- | ||
44 | |||
45 | Lesson 2: reader-writer spinlocks. | ||
46 | |||
47 | If your data accesses have a very natural pattern where you usually tend | ||
48 | to mostly read from the shared variables, the reader-writer locks | ||
49 | (rw_lock) versions of the spinlocks are sometimes useful. They allow multiple | ||
50 | readers to be in the same critical region at once, but if somebody wants | ||
51 | to change the variables it has to get an exclusive write lock. | ||
52 | |||
53 | NOTE! reader-writer locks require more atomic memory operations than | ||
54 | simple spinlocks. Unless the reader critical section is long, you | ||
55 | are better off just using spinlocks. | ||
56 | |||
57 | The routines look the same as above: | ||
58 | |||
59 | rwlock_t xxx_lock = __RW_LOCK_UNLOCKED(xxx_lock); | ||
60 | |||
61 | unsigned long flags; | ||
62 | |||
63 | read_lock_irqsave(&xxx_lock, flags); | ||
64 | .. critical section that only reads the info ... | ||
65 | read_unlock_irqrestore(&xxx_lock, flags); | ||
66 | |||
67 | write_lock_irqsave(&xxx_lock, flags); | ||
68 | .. read and write exclusive access to the info ... | ||
69 | write_unlock_irqrestore(&xxx_lock, flags); | ||
70 | |||
71 | The above kind of lock may be useful for complex data structures like | ||
72 | linked lists, especially searching for entries without changing the list | ||
73 | itself. The read lock allows many concurrent readers. Anything that | ||
74 | _changes_ the list will have to get the write lock. | ||
75 | |||
76 | NOTE! RCU is better for list traversal, but requires careful | ||
77 | attention to design detail (see Documentation/RCU/listRCU.txt). | ||
78 | |||
79 | Also, you cannot "upgrade" a read-lock to a write-lock, so if you at _any_ | ||
80 | time need to do any changes (even if you don't do it every time), you have | ||
81 | to get the write-lock at the very beginning. | ||
82 | |||
83 | NOTE! We are working hard to remove reader-writer spinlocks in most | ||
84 | cases, so please don't add a new one without consensus. (Instead, see | ||
85 | Documentation/RCU/rcu.txt for complete information.) | ||
86 | |||
87 | ---- | ||
88 | |||
89 | Lesson 3: spinlocks revisited. | ||
90 | |||
91 | The single spin-lock primitives above are by no means the only ones. They | ||
92 | are the most safe ones, and the ones that work under all circumstances, | ||
93 | but partly _because_ they are safe they are also fairly slow. They are slower | ||
94 | than they'd need to be, because they do have to disable interrupts | ||
95 | (which is just a single instruction on a x86, but it's an expensive one - | ||
96 | and on other architectures it can be worse). | ||
97 | |||
98 | If you have a case where you have to protect a data structure across | ||
99 | several CPU's and you want to use spinlocks you can potentially use | ||
100 | cheaper versions of the spinlocks. IFF you know that the spinlocks are | ||
101 | never used in interrupt handlers, you can use the non-irq versions: | ||
102 | |||
103 | spin_lock(&lock); | ||
104 | ... | ||
105 | spin_unlock(&lock); | ||
106 | |||
107 | (and the equivalent read-write versions too, of course). The spinlock will | ||
108 | guarantee the same kind of exclusive access, and it will be much faster. | ||
109 | This is useful if you know that the data in question is only ever | ||
110 | manipulated from a "process context", ie no interrupts involved. | ||
111 | |||
112 | The reasons you mustn't use these versions if you have interrupts that | ||
113 | play with the spinlock is that you can get deadlocks: | ||
114 | |||
115 | spin_lock(&lock); | ||
116 | ... | ||
117 | <- interrupt comes in: | ||
118 | spin_lock(&lock); | ||
119 | |||
120 | where an interrupt tries to lock an already locked variable. This is ok if | ||
121 | the other interrupt happens on another CPU, but it is _not_ ok if the | ||
122 | interrupt happens on the same CPU that already holds the lock, because the | ||
123 | lock will obviously never be released (because the interrupt is waiting | ||
124 | for the lock, and the lock-holder is interrupted by the interrupt and will | ||
125 | not continue until the interrupt has been processed). | ||
126 | |||
127 | (This is also the reason why the irq-versions of the spinlocks only need | ||
128 | to disable the _local_ interrupts - it's ok to use spinlocks in interrupts | ||
129 | on other CPU's, because an interrupt on another CPU doesn't interrupt the | ||
130 | CPU that holds the lock, so the lock-holder can continue and eventually | ||
131 | releases the lock). | ||
132 | |||
133 | Note that you can be clever with read-write locks and interrupts. For | ||
134 | example, if you know that the interrupt only ever gets a read-lock, then | ||
135 | you can use a non-irq version of read locks everywhere - because they | ||
136 | don't block on each other (and thus there is no dead-lock wrt interrupts. | ||
137 | But when you do the write-lock, you have to use the irq-safe version. | ||
138 | |||
139 | For an example of being clever with rw-locks, see the "waitqueue_lock" | ||
140 | handling in kernel/sched/core.c - nothing ever _changes_ a wait-queue from | ||
141 | within an interrupt, they only read the queue in order to know whom to | ||
142 | wake up. So read-locks are safe (which is good: they are very common | ||
143 | indeed), while write-locks need to protect themselves against interrupts. | ||
144 | |||
145 | Linus | ||
146 | |||
147 | ---- | ||
148 | |||
149 | Reference information: | ||
150 | |||
151 | For dynamic initialization, use spin_lock_init() or rwlock_init() as | ||
152 | appropriate: | ||
153 | |||
154 | spinlock_t xxx_lock; | ||
155 | rwlock_t xxx_rw_lock; | ||
156 | |||
157 | static int __init xxx_init(void) | ||
158 | { | ||
159 | spin_lock_init(&xxx_lock); | ||
160 | rwlock_init(&xxx_rw_lock); | ||
161 | ... | ||
162 | } | ||
163 | |||
164 | module_init(xxx_init); | ||
165 | |||
166 | For static initialization, use DEFINE_SPINLOCK() / DEFINE_RWLOCK() or | ||
167 | __SPIN_LOCK_UNLOCKED() / __RW_LOCK_UNLOCKED() as appropriate. | ||
diff --git a/Documentation/locking/ww-mutex-design.txt b/Documentation/locking/ww-mutex-design.txt new file mode 100644 index 000000000000..8a112dc304c3 --- /dev/null +++ b/Documentation/locking/ww-mutex-design.txt | |||
@@ -0,0 +1,344 @@ | |||
1 | Wait/Wound Deadlock-Proof Mutex Design | ||
2 | ====================================== | ||
3 | |||
4 | Please read mutex-design.txt first, as it applies to wait/wound mutexes too. | ||
5 | |||
6 | Motivation for WW-Mutexes | ||
7 | ------------------------- | ||
8 | |||
9 | GPU's do operations that commonly involve many buffers. Those buffers | ||
10 | can be shared across contexts/processes, exist in different memory | ||
11 | domains (for example VRAM vs system memory), and so on. And with | ||
12 | PRIME / dmabuf, they can even be shared across devices. So there are | ||
13 | a handful of situations where the driver needs to wait for buffers to | ||
14 | become ready. If you think about this in terms of waiting on a buffer | ||
15 | mutex for it to become available, this presents a problem because | ||
16 | there is no way to guarantee that buffers appear in a execbuf/batch in | ||
17 | the same order in all contexts. That is directly under control of | ||
18 | userspace, and a result of the sequence of GL calls that an application | ||
19 | makes. Which results in the potential for deadlock. The problem gets | ||
20 | more complex when you consider that the kernel may need to migrate the | ||
21 | buffer(s) into VRAM before the GPU operates on the buffer(s), which | ||
22 | may in turn require evicting some other buffers (and you don't want to | ||
23 | evict other buffers which are already queued up to the GPU), but for a | ||
24 | simplified understanding of the problem you can ignore this. | ||
25 | |||
26 | The algorithm that the TTM graphics subsystem came up with for dealing with | ||
27 | this problem is quite simple. For each group of buffers (execbuf) that need | ||
28 | to be locked, the caller would be assigned a unique reservation id/ticket, | ||
29 | from a global counter. In case of deadlock while locking all the buffers | ||
30 | associated with a execbuf, the one with the lowest reservation ticket (i.e. | ||
31 | the oldest task) wins, and the one with the higher reservation id (i.e. the | ||
32 | younger task) unlocks all of the buffers that it has already locked, and then | ||
33 | tries again. | ||
34 | |||
35 | In the RDBMS literature this deadlock handling approach is called wait/wound: | ||
36 | The older tasks waits until it can acquire the contended lock. The younger tasks | ||
37 | needs to back off and drop all the locks it is currently holding, i.e. the | ||
38 | younger task is wounded. | ||
39 | |||
40 | Concepts | ||
41 | -------- | ||
42 | |||
43 | Compared to normal mutexes two additional concepts/objects show up in the lock | ||
44 | interface for w/w mutexes: | ||
45 | |||
46 | Acquire context: To ensure eventual forward progress it is important the a task | ||
47 | trying to acquire locks doesn't grab a new reservation id, but keeps the one it | ||
48 | acquired when starting the lock acquisition. This ticket is stored in the | ||
49 | acquire context. Furthermore the acquire context keeps track of debugging state | ||
50 | to catch w/w mutex interface abuse. | ||
51 | |||
52 | W/w class: In contrast to normal mutexes the lock class needs to be explicit for | ||
53 | w/w mutexes, since it is required to initialize the acquire context. | ||
54 | |||
55 | Furthermore there are three different class of w/w lock acquire functions: | ||
56 | |||
57 | * Normal lock acquisition with a context, using ww_mutex_lock. | ||
58 | |||
59 | * Slowpath lock acquisition on the contending lock, used by the wounded task | ||
60 | after having dropped all already acquired locks. These functions have the | ||
61 | _slow postfix. | ||
62 | |||
63 | From a simple semantics point-of-view the _slow functions are not strictly | ||
64 | required, since simply calling the normal ww_mutex_lock functions on the | ||
65 | contending lock (after having dropped all other already acquired locks) will | ||
66 | work correctly. After all if no other ww mutex has been acquired yet there's | ||
67 | no deadlock potential and hence the ww_mutex_lock call will block and not | ||
68 | prematurely return -EDEADLK. The advantage of the _slow functions is in | ||
69 | interface safety: | ||
70 | - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow | ||
71 | has a void return type. Note that since ww mutex code needs loops/retries | ||
72 | anyway the __must_check doesn't result in spurious warnings, even though the | ||
73 | very first lock operation can never fail. | ||
74 | - When full debugging is enabled ww_mutex_lock_slow checks that all acquired | ||
75 | ww mutex have been released (preventing deadlocks) and makes sure that we | ||
76 | block on the contending lock (preventing spinning through the -EDEADLK | ||
77 | slowpath until the contended lock can be acquired). | ||
78 | |||
79 | * Functions to only acquire a single w/w mutex, which results in the exact same | ||
80 | semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL | ||
81 | context. | ||
82 | |||
83 | Again this is not strictly required. But often you only want to acquire a | ||
84 | single lock in which case it's pointless to set up an acquire context (and so | ||
85 | better to avoid grabbing a deadlock avoidance ticket). | ||
86 | |||
87 | Of course, all the usual variants for handling wake-ups due to signals are also | ||
88 | provided. | ||
89 | |||
90 | Usage | ||
91 | ----- | ||
92 | |||
93 | Three different ways to acquire locks within the same w/w class. Common | ||
94 | definitions for methods #1 and #2: | ||
95 | |||
96 | static DEFINE_WW_CLASS(ww_class); | ||
97 | |||
98 | struct obj { | ||
99 | struct ww_mutex lock; | ||
100 | /* obj data */ | ||
101 | }; | ||
102 | |||
103 | struct obj_entry { | ||
104 | struct list_head head; | ||
105 | struct obj *obj; | ||
106 | }; | ||
107 | |||
108 | Method 1, using a list in execbuf->buffers that's not allowed to be reordered. | ||
109 | This is useful if a list of required objects is already tracked somewhere. | ||
110 | Furthermore the lock helper can use propagate the -EALREADY return code back to | ||
111 | the caller as a signal that an object is twice on the list. This is useful if | ||
112 | the list is constructed from userspace input and the ABI requires userspace to | ||
113 | not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl). | ||
114 | |||
115 | int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) | ||
116 | { | ||
117 | struct obj *res_obj = NULL; | ||
118 | struct obj_entry *contended_entry = NULL; | ||
119 | struct obj_entry *entry; | ||
120 | |||
121 | ww_acquire_init(ctx, &ww_class); | ||
122 | |||
123 | retry: | ||
124 | list_for_each_entry (entry, list, head) { | ||
125 | if (entry->obj == res_obj) { | ||
126 | res_obj = NULL; | ||
127 | continue; | ||
128 | } | ||
129 | ret = ww_mutex_lock(&entry->obj->lock, ctx); | ||
130 | if (ret < 0) { | ||
131 | contended_entry = entry; | ||
132 | goto err; | ||
133 | } | ||
134 | } | ||
135 | |||
136 | ww_acquire_done(ctx); | ||
137 | return 0; | ||
138 | |||
139 | err: | ||
140 | list_for_each_entry_continue_reverse (entry, list, head) | ||
141 | ww_mutex_unlock(&entry->obj->lock); | ||
142 | |||
143 | if (res_obj) | ||
144 | ww_mutex_unlock(&res_obj->lock); | ||
145 | |||
146 | if (ret == -EDEADLK) { | ||
147 | /* we lost out in a seqno race, lock and retry.. */ | ||
148 | ww_mutex_lock_slow(&contended_entry->obj->lock, ctx); | ||
149 | res_obj = contended_entry->obj; | ||
150 | goto retry; | ||
151 | } | ||
152 | ww_acquire_fini(ctx); | ||
153 | |||
154 | return ret; | ||
155 | } | ||
156 | |||
157 | Method 2, using a list in execbuf->buffers that can be reordered. Same semantics | ||
158 | of duplicate entry detection using -EALREADY as method 1 above. But the | ||
159 | list-reordering allows for a bit more idiomatic code. | ||
160 | |||
161 | int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) | ||
162 | { | ||
163 | struct obj_entry *entry, *entry2; | ||
164 | |||
165 | ww_acquire_init(ctx, &ww_class); | ||
166 | |||
167 | list_for_each_entry (entry, list, head) { | ||
168 | ret = ww_mutex_lock(&entry->obj->lock, ctx); | ||
169 | if (ret < 0) { | ||
170 | entry2 = entry; | ||
171 | |||
172 | list_for_each_entry_continue_reverse (entry2, list, head) | ||
173 | ww_mutex_unlock(&entry2->obj->lock); | ||
174 | |||
175 | if (ret != -EDEADLK) { | ||
176 | ww_acquire_fini(ctx); | ||
177 | return ret; | ||
178 | } | ||
179 | |||
180 | /* we lost out in a seqno race, lock and retry.. */ | ||
181 | ww_mutex_lock_slow(&entry->obj->lock, ctx); | ||
182 | |||
183 | /* | ||
184 | * Move buf to head of the list, this will point | ||
185 | * buf->next to the first unlocked entry, | ||
186 | * restarting the for loop. | ||
187 | */ | ||
188 | list_del(&entry->head); | ||
189 | list_add(&entry->head, list); | ||
190 | } | ||
191 | } | ||
192 | |||
193 | ww_acquire_done(ctx); | ||
194 | return 0; | ||
195 | } | ||
196 | |||
197 | Unlocking works the same way for both methods #1 and #2: | ||
198 | |||
199 | void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) | ||
200 | { | ||
201 | struct obj_entry *entry; | ||
202 | |||
203 | list_for_each_entry (entry, list, head) | ||
204 | ww_mutex_unlock(&entry->obj->lock); | ||
205 | |||
206 | ww_acquire_fini(ctx); | ||
207 | } | ||
208 | |||
209 | Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, | ||
210 | e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, | ||
211 | and edges can only be changed when holding the locks of all involved nodes. w/w | ||
212 | mutexes are a natural fit for such a case for two reasons: | ||
213 | - They can handle lock-acquisition in any order which allows us to start walking | ||
214 | a graph from a starting point and then iteratively discovering new edges and | ||
215 | locking down the nodes those edges connect to. | ||
216 | - Due to the -EALREADY return code signalling that a given objects is already | ||
217 | held there's no need for additional book-keeping to break cycles in the graph | ||
218 | or keep track off which looks are already held (when using more than one node | ||
219 | as a starting point). | ||
220 | |||
221 | Note that this approach differs in two important ways from the above methods: | ||
222 | - Since the list of objects is dynamically constructed (and might very well be | ||
223 | different when retrying due to hitting the -EDEADLK wound condition) there's | ||
224 | no need to keep any object on a persistent list when it's not locked. We can | ||
225 | therefore move the list_head into the object itself. | ||
226 | - On the other hand the dynamic object list construction also means that the -EALREADY return | ||
227 | code can't be propagated. | ||
228 | |||
229 | Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a | ||
230 | list of starting nodes (passed in from userspace) using one of the above | ||
231 | methods. And then lock any additional objects affected by the operations using | ||
232 | method #3 below. The backoff/retry procedure will be a bit more involved, since | ||
233 | when the dynamic locking step hits -EDEADLK we also need to unlock all the | ||
234 | objects acquired with the fixed list. But the w/w mutex debug checks will catch | ||
235 | any interface misuse for these cases. | ||
236 | |||
237 | Also, method 3 can't fail the lock acquisition step since it doesn't return | ||
238 | -EALREADY. Of course this would be different when using the _interruptible | ||
239 | variants, but that's outside of the scope of these examples here. | ||
240 | |||
241 | struct obj { | ||
242 | struct ww_mutex ww_mutex; | ||
243 | struct list_head locked_list; | ||
244 | }; | ||
245 | |||
246 | static DEFINE_WW_CLASS(ww_class); | ||
247 | |||
248 | void __unlock_objs(struct list_head *list) | ||
249 | { | ||
250 | struct obj *entry, *temp; | ||
251 | |||
252 | list_for_each_entry_safe (entry, temp, list, locked_list) { | ||
253 | /* need to do that before unlocking, since only the current lock holder is | ||
254 | allowed to use object */ | ||
255 | list_del(&entry->locked_list); | ||
256 | ww_mutex_unlock(entry->ww_mutex) | ||
257 | } | ||
258 | } | ||
259 | |||
260 | void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) | ||
261 | { | ||
262 | struct obj *obj; | ||
263 | |||
264 | ww_acquire_init(ctx, &ww_class); | ||
265 | |||
266 | retry: | ||
267 | /* re-init loop start state */ | ||
268 | loop { | ||
269 | /* magic code which walks over a graph and decides which objects | ||
270 | * to lock */ | ||
271 | |||
272 | ret = ww_mutex_lock(obj->ww_mutex, ctx); | ||
273 | if (ret == -EALREADY) { | ||
274 | /* we have that one already, get to the next object */ | ||
275 | continue; | ||
276 | } | ||
277 | if (ret == -EDEADLK) { | ||
278 | __unlock_objs(list); | ||
279 | |||
280 | ww_mutex_lock_slow(obj, ctx); | ||
281 | list_add(&entry->locked_list, list); | ||
282 | goto retry; | ||
283 | } | ||
284 | |||
285 | /* locked a new object, add it to the list */ | ||
286 | list_add_tail(&entry->locked_list, list); | ||
287 | } | ||
288 | |||
289 | ww_acquire_done(ctx); | ||
290 | return 0; | ||
291 | } | ||
292 | |||
293 | void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) | ||
294 | { | ||
295 | __unlock_objs(list); | ||
296 | ww_acquire_fini(ctx); | ||
297 | } | ||
298 | |||
299 | Method 4: Only lock one single objects. In that case deadlock detection and | ||
300 | prevention is obviously overkill, since with grabbing just one lock you can't | ||
301 | produce a deadlock within just one class. To simplify this case the w/w mutex | ||
302 | api can be used with a NULL context. | ||
303 | |||
304 | Implementation Details | ||
305 | ---------------------- | ||
306 | |||
307 | Design: | ||
308 | ww_mutex currently encapsulates a struct mutex, this means no extra overhead for | ||
309 | normal mutex locks, which are far more common. As such there is only a small | ||
310 | increase in code size if wait/wound mutexes are not used. | ||
311 | |||
312 | In general, not much contention is expected. The locks are typically used to | ||
313 | serialize access to resources for devices. The only way to make wakeups | ||
314 | smarter would be at the cost of adding a field to struct mutex_waiter. This | ||
315 | would add overhead to all cases where normal mutexes are used, and | ||
316 | ww_mutexes are generally less performance sensitive. | ||
317 | |||
318 | Lockdep: | ||
319 | Special care has been taken to warn for as many cases of api abuse | ||
320 | as possible. Some common api abuses will be caught with | ||
321 | CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended. | ||
322 | |||
323 | Some of the errors which will be warned about: | ||
324 | - Forgetting to call ww_acquire_fini or ww_acquire_init. | ||
325 | - Attempting to lock more mutexes after ww_acquire_done. | ||
326 | - Attempting to lock the wrong mutex after -EDEADLK and | ||
327 | unlocking all mutexes. | ||
328 | - Attempting to lock the right mutex after -EDEADLK, | ||
329 | before unlocking all mutexes. | ||
330 | |||
331 | - Calling ww_mutex_lock_slow before -EDEADLK was returned. | ||
332 | |||
333 | - Unlocking mutexes with the wrong unlock function. | ||
334 | - Calling one of the ww_acquire_* twice on the same context. | ||
335 | - Using a different ww_class for the mutex than for the ww_acquire_ctx. | ||
336 | - Normal lockdep errors that can result in deadlocks. | ||
337 | |||
338 | Some of the lockdep errors that can result in deadlocks: | ||
339 | - Calling ww_acquire_init to initialize a second ww_acquire_ctx before | ||
340 | having called ww_acquire_fini on the first. | ||
341 | - 'normal' deadlocks that can occur. | ||
342 | |||
343 | FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic | ||
344 | implemented. | ||