diff options
Diffstat (limited to 'Documentation/rt-mutex-design.txt')
-rw-r--r-- | Documentation/rt-mutex-design.txt | 781 |
1 files changed, 781 insertions, 0 deletions
diff --git a/Documentation/rt-mutex-design.txt b/Documentation/rt-mutex-design.txt new file mode 100644 index 000000000000..c472ffacc2f6 --- /dev/null +++ b/Documentation/rt-mutex-design.txt | |||
@@ -0,0 +1,781 @@ | |||
1 | # | ||
2 | # Copyright (c) 2006 Steven Rostedt | ||
3 | # Licensed under the GNU Free Documentation License, Version 1.2 | ||
4 | # | ||
5 | |||
6 | RT-mutex implementation design | ||
7 | ------------------------------ | ||
8 | |||
9 | This document tries to describe the design of the rtmutex.c implementation. | ||
10 | It doesn't describe the reasons why rtmutex.c exists. For that please see | ||
11 | Documentation/rt-mutex.txt. Although this document does explain problems | ||
12 | that happen without this code, but that is in the concept to understand | ||
13 | what the code actually is doing. | ||
14 | |||
15 | The goal of this document is to help others understand the priority | ||
16 | inheritance (PI) algorithm that is used, as well as reasons for the | ||
17 | decisions that were made to implement PI in the manner that was done. | ||
18 | |||
19 | |||
20 | Unbounded Priority Inversion | ||
21 | ---------------------------- | ||
22 | |||
23 | Priority inversion is when a lower priority process executes while a higher | ||
24 | priority process wants to run. This happens for several reasons, and | ||
25 | most of the time it can't be helped. Anytime a high priority process wants | ||
26 | to use a resource that a lower priority process has (a mutex for example), | ||
27 | the high priority process must wait until the lower priority process is done | ||
28 | with the resource. This is a priority inversion. What we want to prevent | ||
29 | is something called unbounded priority inversion. That is when the high | ||
30 | priority process is prevented from running by a lower priority process for | ||
31 | an undetermined amount of time. | ||
32 | |||
33 | The classic example of unbounded priority inversion is were you have three | ||
34 | processes, let's call them processes A, B, and C, where A is the highest | ||
35 | priority process, C is the lowest, and B is in between. A tries to grab a lock | ||
36 | that C owns and must wait and lets C run to release the lock. But in the | ||
37 | meantime, B executes, and since B is of a higher priority than C, it preempts C, | ||
38 | but by doing so, it is in fact preempting A which is a higher priority process. | ||
39 | Now there's no way of knowing how long A will be sleeping waiting for C | ||
40 | to release the lock, because for all we know, B is a CPU hog and will | ||
41 | never give C a chance to release the lock. This is called unbounded priority | ||
42 | inversion. | ||
43 | |||
44 | Here's a little ASCII art to show the problem. | ||
45 | |||
46 | grab lock L1 (owned by C) | ||
47 | | | ||
48 | A ---+ | ||
49 | C preempted by B | ||
50 | | | ||
51 | C +----+ | ||
52 | |||
53 | B +--------> | ||
54 | B now keeps A from running. | ||
55 | |||
56 | |||
57 | Priority Inheritance (PI) | ||
58 | ------------------------- | ||
59 | |||
60 | There are several ways to solve this issue, but other ways are out of scope | ||
61 | for this document. Here we only discuss PI. | ||
62 | |||
63 | PI is where a process inherits the priority of another process if the other | ||
64 | process blocks on a lock owned by the current process. To make this easier | ||
65 | to understand, let's use the previous example, with processes A, B, and C again. | ||
66 | |||
67 | This time, when A blocks on the lock owned by C, C would inherit the priority | ||
68 | of A. So now if B becomes runnable, it would not preempt C, since C now has | ||
69 | the high priority of A. As soon as C releases the lock, it loses its | ||
70 | inherited priority, and A then can continue with the resource that C had. | ||
71 | |||
72 | Terminology | ||
73 | ----------- | ||
74 | |||
75 | Here I explain some terminology that is used in this document to help describe | ||
76 | the design that is used to implement PI. | ||
77 | |||
78 | PI chain - The PI chain is an ordered series of locks and processes that cause | ||
79 | processes to inherit priorities from a previous process that is | ||
80 | blocked on one of its locks. This is described in more detail | ||
81 | later in this document. | ||
82 | |||
83 | mutex - In this document, to differentiate from locks that implement | ||
84 | PI and spin locks that are used in the PI code, from now on | ||
85 | the PI locks will be called a mutex. | ||
86 | |||
87 | lock - In this document from now on, I will use the term lock when | ||
88 | referring to spin locks that are used to protect parts of the PI | ||
89 | algorithm. These locks disable preemption for UP (when | ||
90 | CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from | ||
91 | entering critical sections simultaneously. | ||
92 | |||
93 | spin lock - Same as lock above. | ||
94 | |||
95 | waiter - A waiter is a struct that is stored on the stack of a blocked | ||
96 | process. Since the scope of the waiter is within the code for | ||
97 | a process being blocked on the mutex, it is fine to allocate | ||
98 | the waiter on the process's stack (local variable). This | ||
99 | structure holds a pointer to the task, as well as the mutex that | ||
100 | the task is blocked on. It also has the plist node structures to | ||
101 | place the task in the waiter_list of a mutex as well as the | ||
102 | pi_list of a mutex owner task (described below). | ||
103 | |||
104 | waiter is sometimes used in reference to the task that is waiting | ||
105 | on a mutex. This is the same as waiter->task. | ||
106 | |||
107 | waiters - A list of processes that are blocked on a mutex. | ||
108 | |||
109 | top waiter - The highest priority process waiting on a specific mutex. | ||
110 | |||
111 | top pi waiter - The highest priority process waiting on one of the mutexes | ||
112 | that a specific process owns. | ||
113 | |||
114 | Note: task and process are used interchangeably in this document, mostly to | ||
115 | differentiate between two processes that are being described together. | ||
116 | |||
117 | |||
118 | PI chain | ||
119 | -------- | ||
120 | |||
121 | The PI chain is a list of processes and mutexes that may cause priority | ||
122 | inheritance to take place. Multiple chains may converge, but a chain | ||
123 | would never diverge, since a process can't be blocked on more than one | ||
124 | mutex at a time. | ||
125 | |||
126 | Example: | ||
127 | |||
128 | Process: A, B, C, D, E | ||
129 | Mutexes: L1, L2, L3, L4 | ||
130 | |||
131 | A owns: L1 | ||
132 | B blocked on L1 | ||
133 | B owns L2 | ||
134 | C blocked on L2 | ||
135 | C owns L3 | ||
136 | D blocked on L3 | ||
137 | D owns L4 | ||
138 | E blocked on L4 | ||
139 | |||
140 | The chain would be: | ||
141 | |||
142 | E->L4->D->L3->C->L2->B->L1->A | ||
143 | |||
144 | To show where two chains merge, we could add another process F and | ||
145 | another mutex L5 where B owns L5 and F is blocked on mutex L5. | ||
146 | |||
147 | The chain for F would be: | ||
148 | |||
149 | F->L5->B->L1->A | ||
150 | |||
151 | Since a process may own more than one mutex, but never be blocked on more than | ||
152 | one, the chains merge. | ||
153 | |||
154 | Here we show both chains: | ||
155 | |||
156 | E->L4->D->L3->C->L2-+ | ||
157 | | | ||
158 | +->B->L1->A | ||
159 | | | ||
160 | F->L5-+ | ||
161 | |||
162 | For PI to work, the processes at the right end of these chains (or we may | ||
163 | also call it the Top of the chain) must be equal to or higher in priority | ||
164 | than the processes to the left or below in the chain. | ||
165 | |||
166 | Also since a mutex may have more than one process blocked on it, we can | ||
167 | have multiple chains merge at mutexes. If we add another process G that is | ||
168 | blocked on mutex L2: | ||
169 | |||
170 | G->L2->B->L1->A | ||
171 | |||
172 | And once again, to show how this can grow I will show the merging chains | ||
173 | again. | ||
174 | |||
175 | E->L4->D->L3->C-+ | ||
176 | +->L2-+ | ||
177 | | | | ||
178 | G-+ +->B->L1->A | ||
179 | | | ||
180 | F->L5-+ | ||
181 | |||
182 | |||
183 | Plist | ||
184 | ----- | ||
185 | |||
186 | Before I go further and talk about how the PI chain is stored through lists | ||
187 | on both mutexes and processes, I'll explain the plist. This is similar to | ||
188 | the struct list_head functionality that is already in the kernel. | ||
189 | The implementation of plist is out of scope for this document, but it is | ||
190 | very important to understand what it does. | ||
191 | |||
192 | There are a few differences between plist and list, the most important one | ||
193 | being that plist is a priority sorted linked list. This means that the | ||
194 | priorities of the plist are sorted, such that it takes O(1) to retrieve the | ||
195 | highest priority item in the list. Obviously this is useful to store processes | ||
196 | based on their priorities. | ||
197 | |||
198 | Another difference, which is important for implementation, is that, unlike | ||
199 | list, the head of the list is a different element than the nodes of a list. | ||
200 | So the head of the list is declared as struct plist_head and nodes that will | ||
201 | be added to the list are declared as struct plist_node. | ||
202 | |||
203 | |||
204 | Mutex Waiter List | ||
205 | ----------------- | ||
206 | |||
207 | Every mutex keeps track of all the waiters that are blocked on itself. The mutex | ||
208 | has a plist to store these waiters by priority. This list is protected by | ||
209 | a spin lock that is located in the struct of the mutex. This lock is called | ||
210 | wait_lock. Since the modification of the waiter list is never done in | ||
211 | interrupt context, the wait_lock can be taken without disabling interrupts. | ||
212 | |||
213 | |||
214 | Task PI List | ||
215 | ------------ | ||
216 | |||
217 | To keep track of the PI chains, each process has its own PI list. This is | ||
218 | a list of all top waiters of the mutexes that are owned by the process. | ||
219 | Note that this list only holds the top waiters and not all waiters that are | ||
220 | blocked on mutexes owned by the process. | ||
221 | |||
222 | The top of the task's PI list is always the highest priority task that | ||
223 | is waiting on a mutex that is owned by the task. So if the task has | ||
224 | inherited a priority, it will always be the priority of the task that is | ||
225 | at the top of this list. | ||
226 | |||
227 | This list is stored in the task structure of a process as a plist called | ||
228 | pi_list. This list is protected by a spin lock also in the task structure, | ||
229 | called pi_lock. This lock may also be taken in interrupt context, so when | ||
230 | locking the pi_lock, interrupts must be disabled. | ||
231 | |||
232 | |||
233 | Depth of the PI Chain | ||
234 | --------------------- | ||
235 | |||
236 | The maximum depth of the PI chain is not dynamic, and could actually be | ||
237 | defined. But is very complex to figure it out, since it depends on all | ||
238 | the nesting of mutexes. Let's look at the example where we have 3 mutexes, | ||
239 | L1, L2, and L3, and four separate functions func1, func2, func3 and func4. | ||
240 | The following shows a locking order of L1->L2->L3, but may not actually | ||
241 | be directly nested that way. | ||
242 | |||
243 | void func1(void) | ||
244 | { | ||
245 | mutex_lock(L1); | ||
246 | |||
247 | /* do anything */ | ||
248 | |||
249 | mutex_unlock(L1); | ||
250 | } | ||
251 | |||
252 | void func2(void) | ||
253 | { | ||
254 | mutex_lock(L1); | ||
255 | mutex_lock(L2); | ||
256 | |||
257 | /* do something */ | ||
258 | |||
259 | mutex_unlock(L2); | ||
260 | mutex_unlock(L1); | ||
261 | } | ||
262 | |||
263 | void func3(void) | ||
264 | { | ||
265 | mutex_lock(L2); | ||
266 | mutex_lock(L3); | ||
267 | |||
268 | /* do something else */ | ||
269 | |||
270 | mutex_unlock(L3); | ||
271 | mutex_unlock(L2); | ||
272 | } | ||
273 | |||
274 | void func4(void) | ||
275 | { | ||
276 | mutex_lock(L3); | ||
277 | |||
278 | /* do something again */ | ||
279 | |||
280 | mutex_unlock(L3); | ||
281 | } | ||
282 | |||
283 | Now we add 4 processes that run each of these functions separately. | ||
284 | Processes A, B, C, and D which run functions func1, func2, func3 and func4 | ||
285 | respectively, and such that D runs first and A last. With D being preempted | ||
286 | in func4 in the "do something again" area, we have a locking that follows: | ||
287 | |||
288 | D owns L3 | ||
289 | C blocked on L3 | ||
290 | C owns L2 | ||
291 | B blocked on L2 | ||
292 | B owns L1 | ||
293 | A blocked on L1 | ||
294 | |||
295 | And thus we have the chain A->L1->B->L2->C->L3->D. | ||
296 | |||
297 | This gives us a PI depth of 4 (four processes), but looking at any of the | ||
298 | functions individually, it seems as though they only have at most a locking | ||
299 | depth of two. So, although the locking depth is defined at compile time, | ||
300 | it still is very difficult to find the possibilities of that depth. | ||
301 | |||
302 | Now since mutexes can be defined by user-land applications, we don't want a DOS | ||
303 | type of application that nests large amounts of mutexes to create a large | ||
304 | PI chain, and have the code holding spin locks while looking at a large | ||
305 | amount of data. So to prevent this, the implementation not only implements | ||
306 | a maximum lock depth, but also only holds at most two different locks at a | ||
307 | time, as it walks the PI chain. More about this below. | ||
308 | |||
309 | |||
310 | Mutex owner and flags | ||
311 | --------------------- | ||
312 | |||
313 | The mutex structure contains a pointer to the owner of the mutex. If the | ||
314 | mutex is not owned, this owner is set to NULL. Since all architectures | ||
315 | have the task structure on at least a four byte alignment (and if this is | ||
316 | not true, the rtmutex.c code will be broken!), this allows for the two | ||
317 | least significant bits to be used as flags. This part is also described | ||
318 | in Documentation/rt-mutex.txt, but will also be briefly described here. | ||
319 | |||
320 | Bit 0 is used as the "Pending Owner" flag. This is described later. | ||
321 | Bit 1 is used as the "Has Waiters" flags. This is also described later | ||
322 | in more detail, but is set whenever there are waiters on a mutex. | ||
323 | |||
324 | |||
325 | cmpxchg Tricks | ||
326 | -------------- | ||
327 | |||
328 | Some architectures implement an atomic cmpxchg (Compare and Exchange). This | ||
329 | is used (when applicable) to keep the fast path of grabbing and releasing | ||
330 | mutexes short. | ||
331 | |||
332 | cmpxchg is basically the following function performed atomically: | ||
333 | |||
334 | unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) | ||
335 | { | ||
336 | unsigned long T = *A; | ||
337 | if (*A == *B) { | ||
338 | *A = *C; | ||
339 | } | ||
340 | return T; | ||
341 | } | ||
342 | #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) | ||
343 | |||
344 | This is really nice to have, since it allows you to only update a variable | ||
345 | if the variable is what you expect it to be. You know if it succeeded if | ||
346 | the return value (the old value of A) is equal to B. | ||
347 | |||
348 | The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If | ||
349 | the architecture does not support CMPXCHG, then this macro is simply set | ||
350 | to fail every time. But if CMPXCHG is supported, then this will | ||
351 | help out extremely to keep the fast path short. | ||
352 | |||
353 | The use of rt_mutex_cmpxchg with the flags in the owner field help optimize | ||
354 | the system for architectures that support it. This will also be explained | ||
355 | later in this document. | ||
356 | |||
357 | |||
358 | Priority adjustments | ||
359 | -------------------- | ||
360 | |||
361 | The implementation of the PI code in rtmutex.c has several places that a | ||
362 | process must adjust its priority. With the help of the pi_list of a | ||
363 | process this is rather easy to know what needs to be adjusted. | ||
364 | |||
365 | The functions implementing the task adjustments are rt_mutex_adjust_prio, | ||
366 | __rt_mutex_adjust_prio (same as the former, but expects the task pi_lock | ||
367 | to already be taken), rt_mutex_get_prio, and rt_mutex_setprio. | ||
368 | |||
369 | rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio. | ||
370 | |||
371 | rt_mutex_getprio returns the priority that the task should have. Either the | ||
372 | task's own normal priority, or if a process of a higher priority is waiting on | ||
373 | a mutex owned by the task, then that higher priority should be returned. | ||
374 | Since the pi_list of a task holds an order by priority list of all the top | ||
375 | waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs | ||
376 | to compare the top pi waiter to its own normal priority, and return the higher | ||
377 | priority back. | ||
378 | |||
379 | (Note: if looking at the code, you will notice that the lower number of | ||
380 | prio is returned. This is because the prio field in the task structure | ||
381 | is an inverse order of the actual priority. So a "prio" of 5 is | ||
382 | of higher priority than a "prio" of 10.) | ||
383 | |||
384 | __rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the | ||
385 | result does not equal the task's current priority, then rt_mutex_setprio | ||
386 | is called to adjust the priority of the task to the new priority. | ||
387 | Note that rt_mutex_setprio is defined in kernel/sched.c to implement the | ||
388 | actual change in priority. | ||
389 | |||
390 | It is interesting to note that __rt_mutex_adjust_prio can either increase | ||
391 | or decrease the priority of the task. In the case that a higher priority | ||
392 | process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio | ||
393 | would increase/boost the task's priority. But if a higher priority task | ||
394 | were for some reason to leave the mutex (timeout or signal), this same function | ||
395 | would decrease/unboost the priority of the task. That is because the pi_list | ||
396 | always contains the highest priority task that is waiting on a mutex owned | ||
397 | by the task, so we only need to compare the priority of that top pi waiter | ||
398 | to the normal priority of the given task. | ||
399 | |||
400 | |||
401 | High level overview of the PI chain walk | ||
402 | ---------------------------------------- | ||
403 | |||
404 | The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. | ||
405 | |||
406 | The implementation has gone through several iterations, and has ended up | ||
407 | with what we believe is the best. It walks the PI chain by only grabbing | ||
408 | at most two locks at a time, and is very efficient. | ||
409 | |||
410 | The rt_mutex_adjust_prio_chain can be used either to boost or lower process | ||
411 | priorities. | ||
412 | |||
413 | rt_mutex_adjust_prio_chain is called with a task to be checked for PI | ||
414 | (de)boosting (the owner of a mutex that a process is blocking on), a flag to | ||
415 | check for deadlocking, the mutex that the task owns, and a pointer to a waiter | ||
416 | that is the process's waiter struct that is blocked on the mutex (although this | ||
417 | parameter may be NULL for deboosting). | ||
418 | |||
419 | For this explanation, I will not mention deadlock detection. This explanation | ||
420 | will try to stay at a high level. | ||
421 | |||
422 | When this function is called, there are no locks held. That also means | ||
423 | that the state of the owner and lock can change when entered into this function. | ||
424 | |||
425 | Before this function is called, the task has already had rt_mutex_adjust_prio | ||
426 | performed on it. This means that the task is set to the priority that it | ||
427 | should be at, but the plist nodes of the task's waiter have not been updated | ||
428 | with the new priorities, and that this task may not be in the proper locations | ||
429 | in the pi_lists and wait_lists that the task is blocked on. This function | ||
430 | solves all that. | ||
431 | |||
432 | A loop is entered, where task is the owner to be checked for PI changes that | ||
433 | was passed by parameter (for the first iteration). The pi_lock of this task is | ||
434 | taken to prevent any more changes to the pi_list of the task. This also | ||
435 | prevents new tasks from completing the blocking on a mutex that is owned by this | ||
436 | task. | ||
437 | |||
438 | If the task is not blocked on a mutex then the loop is exited. We are at | ||
439 | the top of the PI chain. | ||
440 | |||
441 | A check is now done to see if the original waiter (the process that is blocked | ||
442 | on the current mutex) is the top pi waiter of the task. That is, is this | ||
443 | waiter on the top of the task's pi_list. If it is not, it either means that | ||
444 | there is another process higher in priority that is blocked on one of the | ||
445 | mutexes that the task owns, or that the waiter has just woken up via a signal | ||
446 | or timeout and has left the PI chain. In either case, the loop is exited, since | ||
447 | we don't need to do any more changes to the priority of the current task, or any | ||
448 | task that owns a mutex that this current task is waiting on. A priority chain | ||
449 | walk is only needed when a new top pi waiter is made to a task. | ||
450 | |||
451 | The next check sees if the task's waiter plist node has the priority equal to | ||
452 | the priority the task is set at. If they are equal, then we are done with | ||
453 | the loop. Remember that the function started with the priority of the | ||
454 | task adjusted, but the plist nodes that hold the task in other processes | ||
455 | pi_lists have not been adjusted. | ||
456 | |||
457 | Next, we look at the mutex that the task is blocked on. The mutex's wait_lock | ||
458 | is taken. This is done by a spin_trylock, because the locking order of the | ||
459 | pi_lock and wait_lock goes in the opposite direction. If we fail to grab the | ||
460 | lock, the pi_lock is released, and we restart the loop. | ||
461 | |||
462 | Now that we have both the pi_lock of the task as well as the wait_lock of | ||
463 | the mutex the task is blocked on, we update the task's waiter's plist node | ||
464 | that is located on the mutex's wait_list. | ||
465 | |||
466 | Now we release the pi_lock of the task. | ||
467 | |||
468 | Next the owner of the mutex has its pi_lock taken, so we can update the | ||
469 | task's entry in the owner's pi_list. If the task is the highest priority | ||
470 | process on the mutex's wait_list, then we remove the previous top waiter | ||
471 | from the owner's pi_list, and replace it with the task. | ||
472 | |||
473 | Note: It is possible that the task was the current top waiter on the mutex, | ||
474 | in which case the task is not yet on the pi_list of the waiter. This | ||
475 | is OK, since plist_del does nothing if the plist node is not on any | ||
476 | list. | ||
477 | |||
478 | If the task was not the top waiter of the mutex, but it was before we | ||
479 | did the priority updates, that means we are deboosting/lowering the | ||
480 | task. In this case, the task is removed from the pi_list of the owner, | ||
481 | and the new top waiter is added. | ||
482 | |||
483 | Lastly, we unlock both the pi_lock of the task, as well as the mutex's | ||
484 | wait_lock, and continue the loop again. On the next iteration of the | ||
485 | loop, the previous owner of the mutex will be the task that will be | ||
486 | processed. | ||
487 | |||
488 | Note: One might think that the owner of this mutex might have changed | ||
489 | since we just grab the mutex's wait_lock. And one could be right. | ||
490 | The important thing to remember is that the owner could not have | ||
491 | become the task that is being processed in the PI chain, since | ||
492 | we have taken that task's pi_lock at the beginning of the loop. | ||
493 | So as long as there is an owner of this mutex that is not the same | ||
494 | process as the tasked being worked on, we are OK. | ||
495 | |||
496 | Looking closely at the code, one might be confused. The check for the | ||
497 | end of the PI chain is when the task isn't blocked on anything or the | ||
498 | task's waiter structure "task" element is NULL. This check is | ||
499 | protected only by the task's pi_lock. But the code to unlock the mutex | ||
500 | sets the task's waiter structure "task" element to NULL with only | ||
501 | the protection of the mutex's wait_lock, which was not taken yet. | ||
502 | Isn't this a race condition if the task becomes the new owner? | ||
503 | |||
504 | The answer is No! The trick is the spin_trylock of the mutex's | ||
505 | wait_lock. If we fail that lock, we release the pi_lock of the | ||
506 | task and continue the loop, doing the end of PI chain check again. | ||
507 | |||
508 | In the code to release the lock, the wait_lock of the mutex is held | ||
509 | the entire time, and it is not let go when we grab the pi_lock of the | ||
510 | new owner of the mutex. So if the switch of a new owner were to happen | ||
511 | after the check for end of the PI chain and the grabbing of the | ||
512 | wait_lock, the unlocking code would spin on the new owner's pi_lock | ||
513 | but never give up the wait_lock. So the PI chain loop is guaranteed to | ||
514 | fail the spin_trylock on the wait_lock, release the pi_lock, and | ||
515 | try again. | ||
516 | |||
517 | If you don't quite understand the above, that's OK. You don't have to, | ||
518 | unless you really want to make a proof out of it ;) | ||
519 | |||
520 | |||
521 | Pending Owners and Lock stealing | ||
522 | -------------------------------- | ||
523 | |||
524 | One of the flags in the owner field of the mutex structure is "Pending Owner". | ||
525 | What this means is that an owner was chosen by the process releasing the | ||
526 | mutex, but that owner has yet to wake up and actually take the mutex. | ||
527 | |||
528 | Why is this important? Why can't we just give the mutex to another process | ||
529 | and be done with it? | ||
530 | |||
531 | The PI code is to help with real-time processes, and to let the highest | ||
532 | priority process run as long as possible with little latencies and delays. | ||
533 | If a high priority process owns a mutex that a lower priority process is | ||
534 | blocked on, when the mutex is released it would be given to the lower priority | ||
535 | process. What if the higher priority process wants to take that mutex again. | ||
536 | The high priority process would fail to take that mutex that it just gave up | ||
537 | and it would need to boost the lower priority process to run with full | ||
538 | latency of that critical section (since the low priority process just entered | ||
539 | it). | ||
540 | |||
541 | There's no reason a high priority process that gives up a mutex should be | ||
542 | penalized if it tries to take that mutex again. If the new owner of the | ||
543 | mutex has not woken up yet, there's no reason that the higher priority process | ||
544 | could not take that mutex away. | ||
545 | |||
546 | To solve this, we introduced Pending Ownership and Lock Stealing. When a | ||
547 | new process is given a mutex that it was blocked on, it is only given | ||
548 | pending ownership. This means that it's the new owner, unless a higher | ||
549 | priority process comes in and tries to grab that mutex. If a higher priority | ||
550 | process does come along and wants that mutex, we let the higher priority | ||
551 | process "steal" the mutex from the pending owner (only if it is still pending) | ||
552 | and continue with the mutex. | ||
553 | |||
554 | |||
555 | Taking of a mutex (The walk through) | ||
556 | ------------------------------------ | ||
557 | |||
558 | OK, now let's take a look at the detailed walk through of what happens when | ||
559 | taking a mutex. | ||
560 | |||
561 | The first thing that is tried is the fast taking of the mutex. This is | ||
562 | done when we have CMPXCHG enabled (otherwise the fast taking automatically | ||
563 | fails). Only when the owner field of the mutex is NULL can the lock be | ||
564 | taken with the CMPXCHG and nothing else needs to be done. | ||
565 | |||
566 | If there is contention on the lock, whether it is owned or pending owner | ||
567 | we go about the slow path (rt_mutex_slowlock). | ||
568 | |||
569 | The slow path function is where the task's waiter structure is created on | ||
570 | the stack. This is because the waiter structure is only needed for the | ||
571 | scope of this function. The waiter structure holds the nodes to store | ||
572 | the task on the wait_list of the mutex, and if need be, the pi_list of | ||
573 | the owner. | ||
574 | |||
575 | The wait_lock of the mutex is taken since the slow path of unlocking the | ||
576 | mutex also takes this lock. | ||
577 | |||
578 | We then call try_to_take_rt_mutex. This is where the architecture that | ||
579 | does not implement CMPXCHG would always grab the lock (if there's no | ||
580 | contention). | ||
581 | |||
582 | try_to_take_rt_mutex is used every time the task tries to grab a mutex in the | ||
583 | slow path. The first thing that is done here is an atomic setting of | ||
584 | the "Has Waiters" flag of the mutex's owner field. Yes, this could really | ||
585 | be false, because if the the mutex has no owner, there are no waiters and | ||
586 | the current task also won't have any waiters. But we don't have the lock | ||
587 | yet, so we assume we are going to be a waiter. The reason for this is to | ||
588 | play nice for those architectures that do have CMPXCHG. By setting this flag | ||
589 | now, the owner of the mutex can't release the mutex without going into the | ||
590 | slow unlock path, and it would then need to grab the wait_lock, which this | ||
591 | code currently holds. So setting the "Has Waiters" flag forces the owner | ||
592 | to synchronize with this code. | ||
593 | |||
594 | Now that we know that we can't have any races with the owner releasing the | ||
595 | mutex, we check to see if we can take the ownership. This is done if the | ||
596 | mutex doesn't have a owner, or if we can steal the mutex from a pending | ||
597 | owner. Let's look at the situations we have here. | ||
598 | |||
599 | 1) Has owner that is pending | ||
600 | ---------------------------- | ||
601 | |||
602 | The mutex has a owner, but it hasn't woken up and the mutex flag | ||
603 | "Pending Owner" is set. The first check is to see if the owner isn't the | ||
604 | current task. This is because this function is also used for the pending | ||
605 | owner to grab the mutex. When a pending owner wakes up, it checks to see | ||
606 | if it can take the mutex, and this is done if the owner is already set to | ||
607 | itself. If so, we succeed and leave the function, clearing the "Pending | ||
608 | Owner" bit. | ||
609 | |||
610 | If the pending owner is not current, we check to see if the current priority is | ||
611 | higher than the pending owner. If not, we fail the function and return. | ||
612 | |||
613 | There's also something special about a pending owner. That is a pending owner | ||
614 | is never blocked on a mutex. So there is no PI chain to worry about. It also | ||
615 | means that if the mutex doesn't have any waiters, there's no accounting needed | ||
616 | to update the pending owner's pi_list, since we only worry about processes | ||
617 | blocked on the current mutex. | ||
618 | |||
619 | If there are waiters on this mutex, and we just stole the ownership, we need | ||
620 | to take the top waiter, remove it from the pi_list of the pending owner, and | ||
621 | add it to the current pi_list. Note that at this moment, the pending owner | ||
622 | is no longer on the list of waiters. This is fine, since the pending owner | ||
623 | would add itself back when it realizes that it had the ownership stolen | ||
624 | from itself. When the pending owner tries to grab the mutex, it will fail | ||
625 | in try_to_take_rt_mutex if the owner field points to another process. | ||
626 | |||
627 | 2) No owner | ||
628 | ----------- | ||
629 | |||
630 | If there is no owner (or we successfully stole the lock), we set the owner | ||
631 | of the mutex to current, and set the flag of "Has Waiters" if the current | ||
632 | mutex actually has waiters, or we clear the flag if it doesn't. See, it was | ||
633 | OK that we set that flag early, since now it is cleared. | ||
634 | |||
635 | 3) Failed to grab ownership | ||
636 | --------------------------- | ||
637 | |||
638 | The most interesting case is when we fail to take ownership. This means that | ||
639 | there exists an owner, or there's a pending owner with equal or higher | ||
640 | priority than the current task. | ||
641 | |||
642 | We'll continue on the failed case. | ||
643 | |||
644 | If the mutex has a timeout, we set up a timer to go off to break us out | ||
645 | of this mutex if we failed to get it after a specified amount of time. | ||
646 | |||
647 | Now we enter a loop that will continue to try to take ownership of the mutex, or | ||
648 | fail from a timeout or signal. | ||
649 | |||
650 | Once again we try to take the mutex. This will usually fail the first time | ||
651 | in the loop, since it had just failed to get the mutex. But the second time | ||
652 | in the loop, this would likely succeed, since the task would likely be | ||
653 | the pending owner. | ||
654 | |||
655 | If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done | ||
656 | here. | ||
657 | |||
658 | The waiter structure has a "task" field that points to the task that is blocked | ||
659 | on the mutex. This field can be NULL the first time it goes through the loop | ||
660 | or if the task is a pending owner and had it's mutex stolen. If the "task" | ||
661 | field is NULL then we need to set up the accounting for it. | ||
662 | |||
663 | Task blocks on mutex | ||
664 | -------------------- | ||
665 | |||
666 | The accounting of a mutex and process is done with the waiter structure of | ||
667 | the process. The "task" field is set to the process, and the "lock" field | ||
668 | to the mutex. The plist nodes are initialized to the processes current | ||
669 | priority. | ||
670 | |||
671 | Since the wait_lock was taken at the entry of the slow lock, we can safely | ||
672 | add the waiter to the wait_list. If the current process is the highest | ||
673 | priority process currently waiting on this mutex, then we remove the | ||
674 | previous top waiter process (if it exists) from the pi_list of the owner, | ||
675 | and add the current process to that list. Since the pi_list of the owner | ||
676 | has changed, we call rt_mutex_adjust_prio on the owner to see if the owner | ||
677 | should adjust its priority accordingly. | ||
678 | |||
679 | If the owner is also blocked on a lock, and had its pi_list changed | ||
680 | (or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead | ||
681 | and run rt_mutex_adjust_prio_chain on the owner, as described earlier. | ||
682 | |||
683 | Now all locks are released, and if the current process is still blocked on a | ||
684 | mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). | ||
685 | |||
686 | Waking up in the loop | ||
687 | --------------------- | ||
688 | |||
689 | The schedule can then wake up for a few reasons. | ||
690 | 1) we were given pending ownership of the mutex. | ||
691 | 2) we received a signal and was TASK_INTERRUPTIBLE | ||
692 | 3) we had a timeout and was TASK_INTERRUPTIBLE | ||
693 | |||
694 | In any of these cases, we continue the loop and once again try to grab the | ||
695 | ownership of the mutex. If we succeed, we exit the loop, otherwise we continue | ||
696 | and on signal and timeout, will exit the loop, or if we had the mutex stolen | ||
697 | we just simply add ourselves back on the lists and go back to sleep. | ||
698 | |||
699 | Note: For various reasons, because of timeout and signals, the steal mutex | ||
700 | algorithm needs to be careful. This is because the current process is | ||
701 | still on the wait_list. And because of dynamic changing of priorities, | ||
702 | especially on SCHED_OTHER tasks, the current process can be the | ||
703 | highest priority task on the wait_list. | ||
704 | |||
705 | Failed to get mutex on Timeout or Signal | ||
706 | ---------------------------------------- | ||
707 | |||
708 | If a timeout or signal occurred, the waiter's "task" field would not be | ||
709 | NULL and the task needs to be taken off the wait_list of the mutex and perhaps | ||
710 | pi_list of the owner. If this process was a high priority process, then | ||
711 | the rt_mutex_adjust_prio_chain needs to be executed again on the owner, | ||
712 | but this time it will be lowering the priorities. | ||
713 | |||
714 | |||
715 | Unlocking the Mutex | ||
716 | ------------------- | ||
717 | |||
718 | The unlocking of a mutex also has a fast path for those architectures with | ||
719 | CMPXCHG. Since the taking of a mutex on contention always sets the | ||
720 | "Has Waiters" flag of the mutex's owner, we use this to know if we need to | ||
721 | take the slow path when unlocking the mutex. If the mutex doesn't have any | ||
722 | waiters, the owner field of the mutex would equal the current process and | ||
723 | the mutex can be unlocked by just replacing the owner field with NULL. | ||
724 | |||
725 | If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), | ||
726 | the slow unlock path is taken. | ||
727 | |||
728 | The first thing done in the slow unlock path is to take the wait_lock of the | ||
729 | mutex. This synchronizes the locking and unlocking of the mutex. | ||
730 | |||
731 | A check is made to see if the mutex has waiters or not. On architectures that | ||
732 | do not have CMPXCHG, this is the location that the owner of the mutex will | ||
733 | determine if a waiter needs to be awoken or not. On architectures that | ||
734 | do have CMPXCHG, that check is done in the fast path, but it is still needed | ||
735 | in the slow path too. If a waiter of a mutex woke up because of a signal | ||
736 | or timeout between the time the owner failed the fast path CMPXCHG check and | ||
737 | the grabbing of the wait_lock, the mutex may not have any waiters, thus the | ||
738 | owner still needs to make this check. If there are no waiters than the mutex | ||
739 | owner field is set to NULL, the wait_lock is released and nothing more is | ||
740 | needed. | ||
741 | |||
742 | If there are waiters, then we need to wake one up and give that waiter | ||
743 | pending ownership. | ||
744 | |||
745 | On the wake up code, the pi_lock of the current owner is taken. The top | ||
746 | waiter of the lock is found and removed from the wait_list of the mutex | ||
747 | as well as the pi_list of the current owner. The task field of the new | ||
748 | pending owner's waiter structure is set to NULL, and the owner field of the | ||
749 | mutex is set to the new owner with the "Pending Owner" bit set, as well | ||
750 | as the "Has Waiters" bit if there still are other processes blocked on the | ||
751 | mutex. | ||
752 | |||
753 | The pi_lock of the previous owner is released, and the new pending owner's | ||
754 | pi_lock is taken. Remember that this is the trick to prevent the race | ||
755 | condition in rt_mutex_adjust_prio_chain from adding itself as a waiter | ||
756 | on the mutex. | ||
757 | |||
758 | We now clear the "pi_blocked_on" field of the new pending owner, and if | ||
759 | the mutex still has waiters pending, we add the new top waiter to the pi_list | ||
760 | of the pending owner. | ||
761 | |||
762 | Finally we unlock the pi_lock of the pending owner and wake it up. | ||
763 | |||
764 | |||
765 | Contact | ||
766 | ------- | ||
767 | |||
768 | For updates on this document, please email Steven Rostedt <rostedt@goodmis.org> | ||
769 | |||
770 | |||
771 | Credits | ||
772 | ------- | ||
773 | |||
774 | Author: Steven Rostedt <rostedt@goodmis.org> | ||
775 | |||
776 | Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap | ||
777 | |||
778 | Updates | ||
779 | ------- | ||
780 | |||
781 | This document was originally written for 2.6.17-rc3-mm1 | ||