diff options
Diffstat (limited to 'Documentation/filesystems/directory-locking')
-rw-r--r-- | Documentation/filesystems/directory-locking | 113 |
1 files changed, 113 insertions, 0 deletions
diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking new file mode 100644 index 000000000000..34380d4fbce3 --- /dev/null +++ b/Documentation/filesystems/directory-locking | |||
@@ -0,0 +1,113 @@ | |||
1 | Locking scheme used for directory operations is based on two | ||
2 | kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem). | ||
3 | |||
4 | For our purposes all operations fall in 5 classes: | ||
5 | |||
6 | 1) read access. Locking rules: caller locks directory we are accessing. | ||
7 | |||
8 | 2) object creation. Locking rules: same as above. | ||
9 | |||
10 | 3) object removal. Locking rules: caller locks parent, finds victim, | ||
11 | locks victim and calls the method. | ||
12 | |||
13 | 4) rename() that is _not_ cross-directory. Locking rules: caller locks | ||
14 | the parent, finds source and target, if target already exists - locks it | ||
15 | and then calls the method. | ||
16 | |||
17 | 5) link creation. Locking rules: | ||
18 | * lock parent | ||
19 | * check that source is not a directory | ||
20 | * lock source | ||
21 | * call the method. | ||
22 | |||
23 | 6) cross-directory rename. The trickiest in the whole bunch. Locking | ||
24 | rules: | ||
25 | * lock the filesystem | ||
26 | * lock parents in "ancestors first" order. | ||
27 | * find source and target. | ||
28 | * if old parent is equal to or is a descendent of target | ||
29 | fail with -ENOTEMPTY | ||
30 | * if new parent is equal to or is a descendent of source | ||
31 | fail with -ELOOP | ||
32 | * if target exists - lock it. | ||
33 | * call the method. | ||
34 | |||
35 | |||
36 | The rules above obviously guarantee that all directories that are going to be | ||
37 | read, modified or removed by method will be locked by caller. | ||
38 | |||
39 | |||
40 | If no directory is its own ancestor, the scheme above is deadlock-free. | ||
41 | Proof: | ||
42 | |||
43 | First of all, at any moment we have a partial ordering of the | ||
44 | objects - A < B iff A is an ancestor of B. | ||
45 | |||
46 | That ordering can change. However, the following is true: | ||
47 | |||
48 | (1) if object removal or non-cross-directory rename holds lock on A and | ||
49 | attempts to acquire lock on B, A will remain the parent of B until we | ||
50 | acquire the lock on B. (Proof: only cross-directory rename can change | ||
51 | the parent of object and it would have to lock the parent). | ||
52 | |||
53 | (2) if cross-directory rename holds the lock on filesystem, order will not | ||
54 | change until rename acquires all locks. (Proof: other cross-directory | ||
55 | renames will be blocked on filesystem lock and we don't start changing | ||
56 | the order until we had acquired all locks). | ||
57 | |||
58 | (3) any operation holds at most one lock on non-directory object and | ||
59 | that lock is acquired after all other locks. (Proof: see descriptions | ||
60 | of operations). | ||
61 | |||
62 | Now consider the minimal deadlock. Each process is blocked on | ||
63 | attempt to acquire some lock and already holds at least one lock. Let's | ||
64 | consider the set of contended locks. First of all, filesystem lock is | ||
65 | not contended, since any process blocked on it is not holding any locks. | ||
66 | Thus all processes are blocked on ->i_sem. | ||
67 | |||
68 | Non-directory objects are not contended due to (3). Thus link | ||
69 | creation can't be a part of deadlock - it can't be blocked on source | ||
70 | and it means that it doesn't hold any locks. | ||
71 | |||
72 | Any contended object is either held by cross-directory rename or | ||
73 | has a child that is also contended. Indeed, suppose that it is held by | ||
74 | operation other than cross-directory rename. Then the lock this operation | ||
75 | is blocked on belongs to child of that object due to (1). | ||
76 | |||
77 | It means that one of the operations is cross-directory rename. | ||
78 | Otherwise the set of contended objects would be infinite - each of them | ||
79 | would have a contended child and we had assumed that no object is its | ||
80 | own descendent. Moreover, there is exactly one cross-directory rename | ||
81 | (see above). | ||
82 | |||
83 | Consider the object blocking the cross-directory rename. One | ||
84 | of its descendents is locked by cross-directory rename (otherwise we | ||
85 | would again have an infinite set of of contended objects). But that | ||
86 | means that cross-directory rename is taking locks out of order. Due | ||
87 | to (2) the order hadn't changed since we had acquired filesystem lock. | ||
88 | But locking rules for cross-directory rename guarantee that we do not | ||
89 | try to acquire lock on descendent before the lock on ancestor. | ||
90 | Contradiction. I.e. deadlock is impossible. Q.E.D. | ||
91 | |||
92 | |||
93 | These operations are guaranteed to avoid loop creation. Indeed, | ||
94 | the only operation that could introduce loops is cross-directory rename. | ||
95 | Since the only new (parent, child) pair added by rename() is (new parent, | ||
96 | source), such loop would have to contain these objects and the rest of it | ||
97 | would have to exist before rename(). I.e. at the moment of loop creation | ||
98 | rename() responsible for that would be holding filesystem lock and new parent | ||
99 | would have to be equal to or a descendent of source. But that means that | ||
100 | new parent had been equal to or a descendent of source since the moment when | ||
101 | we had acquired filesystem lock and rename() would fail with -ELOOP in that | ||
102 | case. | ||
103 | |||
104 | While this locking scheme works for arbitrary DAGs, it relies on | ||
105 | ability to check that directory is a descendent of another object. Current | ||
106 | implementation assumes that directory graph is a tree. This assumption is | ||
107 | also preserved by all operations (cross-directory rename on a tree that would | ||
108 | not introduce a cycle will leave it a tree and link() fails for directories). | ||
109 | |||
110 | Notice that "directory" in the above == "anything that might have | ||
111 | children", so if we are going to introduce hybrid objects we will need | ||
112 | either to make sure that link(2) doesn't work for them or to make changes | ||
113 | in is_subdir() that would make it work even in presence of such beasts. | ||