diff options
Diffstat (limited to 'Documentation/filesystems/path-lookup.txt')
-rw-r--r-- | Documentation/filesystems/path-lookup.txt | 345 |
1 files changed, 345 insertions, 0 deletions
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt new file mode 100644 index 000000000000..09b2878724a1 --- /dev/null +++ b/Documentation/filesystems/path-lookup.txt | |||
@@ -0,0 +1,345 @@ | |||
1 | Path walking and name lookup locking | ||
2 | ==================================== | ||
3 | |||
4 | Path resolution is the finding a dentry corresponding to a path name string, by | ||
5 | performing a path walk. Typically, for every open(), stat() etc., the path name | ||
6 | will be resolved. Paths are resolved by walking the namespace tree, starting | ||
7 | with the first component of the pathname (eg. root or cwd) with a known dentry, | ||
8 | then finding the child of that dentry, which is named the next component in the | ||
9 | path string. Then repeating the lookup from the child dentry and finding its | ||
10 | child with the next element, and so on. | ||
11 | |||
12 | Since it is a frequent operation for workloads like multiuser environments and | ||
13 | web servers, it is important to optimize this code. | ||
14 | |||
15 | Path walking synchronisation history: | ||
16 | Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and | ||
17 | thus in every component during path look-up. Since 2.5.10 onwards, fast-walk | ||
18 | algorithm changed this by holding the dcache_lock at the beginning and walking | ||
19 | as many cached path component dentries as possible. This significantly | ||
20 | decreases the number of acquisition of dcache_lock. However it also increases | ||
21 | the lock hold time significantly and affects performance in large SMP machines. | ||
22 | Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to | ||
23 | make dcache look-up lock-free. | ||
24 | |||
25 | All the above algorithms required taking a lock and reference count on the | ||
26 | dentry that was looked up, so that may be used as the basis for walking the | ||
27 | next path element. This is inefficient and unscalable. It is inefficient | ||
28 | because of the locks and atomic operations required for every dentry element | ||
29 | slows things down. It is not scalable because many parallel applications that | ||
30 | are path-walk intensive tend to do path lookups starting from a common dentry | ||
31 | (usually, the root "/" or current working directory). So contention on these | ||
32 | common path elements causes lock and cacheline queueing. | ||
33 | |||
34 | Since 2.6.38, RCU is used to make a significant part of the entire path walk | ||
35 | (including dcache look-up) completely "store-free" (so, no locks, atomics, or | ||
36 | even stores into cachelines of common dentries). This is known as "rcu-walk" | ||
37 | path walking. | ||
38 | |||
39 | Path walking overview | ||
40 | ===================== | ||
41 | |||
42 | A name string specifies a start (root directory, cwd, fd-relative) and a | ||
43 | sequence of elements (directory entry names), which together refer to a path in | ||
44 | the namespace. A path is represented as a (dentry, vfsmount) tuple. The name | ||
45 | elements are sub-strings, seperated by '/'. | ||
46 | |||
47 | Name lookups will want to find a particular path that a name string refers to | ||
48 | (usually the final element, or parent of final element). This is done by taking | ||
49 | the path given by the name's starting point (which we know in advance -- eg. | ||
50 | current->fs->cwd or current->fs->root) as the first parent of the lookup. Then | ||
51 | iteratively for each subsequent name element, look up the child of the current | ||
52 | parent with the given name and if it is not the desired entry, make it the | ||
53 | parent for the next lookup. | ||
54 | |||
55 | A parent, of course, must be a directory, and we must have appropriate | ||
56 | permissions on the parent inode to be able to walk into it. | ||
57 | |||
58 | Turning the child into a parent for the next lookup requires more checks and | ||
59 | procedures. Symlinks essentially substitute the symlink name for the target | ||
60 | name in the name string, and require some recursive path walking. Mount points | ||
61 | must be followed into (thus changing the vfsmount that subsequent path elements | ||
62 | refer to), switching from the mount point path to the root of the particular | ||
63 | mounted vfsmount. These behaviours are variously modified depending on the | ||
64 | exact path walking flags. | ||
65 | |||
66 | Path walking then must, broadly, do several particular things: | ||
67 | - find the start point of the walk; | ||
68 | - perform permissions and validity checks on inodes; | ||
69 | - perform dcache hash name lookups on (parent, name element) tuples; | ||
70 | - traverse mount points; | ||
71 | - traverse symlinks; | ||
72 | - lookup and create missing parts of the path on demand. | ||
73 | |||
74 | Safe store-free look-up of dcache hash table | ||
75 | ============================================ | ||
76 | |||
77 | Dcache name lookup | ||
78 | ------------------ | ||
79 | In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple | ||
80 | and use that to select a bucket in the dcache-hash table. The list of entries | ||
81 | in that bucket is then walked, and we do a full comparison of each entry | ||
82 | against our (parent, name) tuple. | ||
83 | |||
84 | The hash lists are RCU protected, so list walking is not serialised with | ||
85 | concurrent updates (insertion, deletion from the hash). This is a standard RCU | ||
86 | list application with the exception of renames, which will be covered below. | ||
87 | |||
88 | Parent and name members of a dentry, as well as its membership in the dcache | ||
89 | hash, and its inode are protected by the per-dentry d_lock spinlock. A | ||
90 | reference is taken on the dentry (while the fields are verified under d_lock), | ||
91 | and this stabilises its d_inode pointer and actual inode. This gives a stable | ||
92 | point to perform the next step of our path walk against. | ||
93 | |||
94 | These members are also protected by d_seq seqlock, although this offers | ||
95 | read-only protection and no durability of results, so care must be taken when | ||
96 | using d_seq for synchronisation (see seqcount based lookups, below). | ||
97 | |||
98 | Renames | ||
99 | ------- | ||
100 | Back to the rename case. In usual RCU protected lists, the only operations that | ||
101 | will happen to an object is insertion, and then eventually removal from the | ||
102 | list. The object will not be reused until an RCU grace period is complete. | ||
103 | This ensures the RCU list traversal primitives can run over the object without | ||
104 | problems (see RCU documentation for how this works). | ||
105 | |||
106 | However when a dentry is renamed, its hash value can change, requiring it to be | ||
107 | moved to a new hash list. Allocating and inserting a new alias would be | ||
108 | expensive and also problematic for directory dentries. Latency would be far to | ||
109 | high to wait for a grace period after removing the dentry and before inserting | ||
110 | it in the new hash bucket. So what is done is to insert the dentry into the | ||
111 | new list immediately. | ||
112 | |||
113 | However, when the dentry's list pointers are updated to point to objects in the | ||
114 | new list before waiting for a grace period, this can result in a concurrent RCU | ||
115 | lookup of the old list veering off into the new (incorrect) list and missing | ||
116 | the remaining dentries on the list. | ||
117 | |||
118 | There is no fundamental problem with walking down the wrong list, because the | ||
119 | dentry comparisons will never match. However it is fatal to miss a matching | ||
120 | dentry. So a seqlock is used to detect when a rename has occurred, and so the | ||
121 | lookup can be retried. | ||
122 | |||
123 | 1 2 3 | ||
124 | +---+ +---+ +---+ | ||
125 | hlist-->| N-+->| N-+->| N-+-> | ||
126 | head <--+-P |<-+-P |<-+-P | | ||
127 | +---+ +---+ +---+ | ||
128 | |||
129 | Rename of dentry 2 may require it deleted from the above list, and inserted | ||
130 | into a new list. Deleting 2 gives the following list. | ||
131 | |||
132 | 1 3 | ||
133 | +---+ +---+ (don't worry, the longer pointers do not | ||
134 | hlist-->| N-+-------->| N-+-> impose a measurable performance overhead | ||
135 | head <--+-P |<--------+-P | on modern CPUs) | ||
136 | +---+ +---+ | ||
137 | ^ 2 ^ | ||
138 | | +---+ | | ||
139 | | | N-+----+ | ||
140 | +----+-P | | ||
141 | +---+ | ||
142 | |||
143 | This is a standard RCU-list deletion, which leaves the deleted object's | ||
144 | pointers intact, so a concurrent list walker that is currently looking at | ||
145 | object 2 will correctly continue to object 3 when it is time to traverse the | ||
146 | next object. | ||
147 | |||
148 | However, when inserting object 2 onto a new list, we end up with this: | ||
149 | |||
150 | 1 3 | ||
151 | +---+ +---+ | ||
152 | hlist-->| N-+-------->| N-+-> | ||
153 | head <--+-P |<--------+-P | | ||
154 | +---+ +---+ | ||
155 | 2 | ||
156 | +---+ | ||
157 | | N-+----> | ||
158 | <----+-P | | ||
159 | +---+ | ||
160 | |||
161 | Because we didn't wait for a grace period, there may be a concurrent lookup | ||
162 | still at 2. Now when it follows 2's 'next' pointer, it will walk off into | ||
163 | another list without ever having checked object 3. | ||
164 | |||
165 | A related, but distinctly different, issue is that of rename atomicity versus | ||
166 | lookup operations. If a file is renamed from 'A' to 'B', a lookup must only | ||
167 | find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup | ||
168 | of 'B' must succeed (note the reverse is not true). | ||
169 | |||
170 | Between deleting the dentry from the old hash list, and inserting it on the new | ||
171 | hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same | ||
172 | rename seqlock is also used to cover this race in much the same way, by | ||
173 | retrying a negative lookup result if a rename was in progress. | ||
174 | |||
175 | Seqcount based lookups | ||
176 | ---------------------- | ||
177 | In refcount based dcache lookups, d_lock is used to serialise access to | ||
178 | the dentry, stabilising it while comparing its name and parent and then | ||
179 | taking a reference count (the reference count then gives a stable place to | ||
180 | start the next part of the path walk from). | ||
181 | |||
182 | As explained above, we would like to do path walking without taking locks or | ||
183 | reference counts on intermediate dentries along the path. To do this, a per | ||
184 | dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry | ||
185 | looks like (its name, parent, and inode). That snapshot is then used to start | ||
186 | the next part of the path walk. When loading the coherent snapshot under d_seq, | ||
187 | care must be taken to load the members up-front, and use those pointers rather | ||
188 | than reloading from the dentry later on (otherwise we'd have interesting things | ||
189 | like d_inode going NULL underneath us, if the name was unlinked). | ||
190 | |||
191 | Also important is to avoid performing any destructive operations (pretty much: | ||
192 | no non-atomic stores to shared data), and to recheck the seqcount when we are | ||
193 | "done" with the operation. Retry or abort if the seqcount does not match. | ||
194 | Avoiding destructive or changing operations means we can easily unwind from | ||
195 | failure. | ||
196 | |||
197 | What this means is that a caller, provided they are holding RCU lock to | ||
198 | protect the dentry object from disappearing, can perform a seqcount based | ||
199 | lookup which does not increment the refcount on the dentry or write to | ||
200 | it in any way. This returned dentry can be used for subsequent operations, | ||
201 | provided that d_seq is rechecked after that operation is complete. | ||
202 | |||
203 | Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be | ||
204 | queried for permissions. | ||
205 | |||
206 | With this two parts of the puzzle, we can do path lookups without taking | ||
207 | locks or refcounts on dentry elements. | ||
208 | |||
209 | RCU-walk path walking design | ||
210 | ============================ | ||
211 | |||
212 | Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk | ||
213 | is the traditional[*] way of performing dcache lookups using d_lock to | ||
214 | serialise concurrent modifications to the dentry and take a reference count on | ||
215 | it. ref-walk is simple and obvious, and may sleep, take locks, etc while path | ||
216 | walking is operating on each dentry. rcu-walk uses seqcount based dentry | ||
217 | lookups, and can perform lookup of intermediate elements without any stores to | ||
218 | shared data in the dentry or inode. rcu-walk can not be applied to all cases, | ||
219 | eg. if the filesystem must sleep or perform non trivial operations, rcu-walk | ||
220 | must be switched to ref-walk mode. | ||
221 | |||
222 | [*] RCU is still used for the dentry hash lookup in ref-walk, but not the full | ||
223 | path walk. | ||
224 | |||
225 | Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining | ||
226 | path string, rcu-walk uses a d_seq protected snapshot. When looking up a | ||
227 | child of this parent snapshot, we open d_seq critical section on the child | ||
228 | before closing d_seq critical section on the parent. This gives an interlocking | ||
229 | ladder of snapshots to walk down. | ||
230 | |||
231 | |||
232 | proc 101 | ||
233 | /----------------\ | ||
234 | / comm: "vi" \ | ||
235 | / fs.root: dentry0 \ | ||
236 | \ fs.cwd: dentry2 / | ||
237 | \ / | ||
238 | \----------------/ | ||
239 | |||
240 | So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will | ||
241 | start from current->fs->root, which is a pinned dentry. Alternatively, | ||
242 | "./test.c" would start from cwd; both names refer to the same path in | ||
243 | the context of proc101. | ||
244 | |||
245 | dentry 0 | ||
246 | +---------------------+ rcu-walk begins here, we note d_seq, check the | ||
247 | | name: "/" | inode's permission, and then look up the next | ||
248 | | inode: 10 | path element which is "home"... | ||
249 | | children:"home", ...| | ||
250 | +---------------------+ | ||
251 | | | ||
252 | dentry 1 V | ||
253 | +---------------------+ ... which brings us here. We find dentry1 via | ||
254 | | name: "home" | hash lookup, then note d_seq and compare name | ||
255 | | inode: 678 | string and parent pointer. When we have a match, | ||
256 | | children:"npiggin" | we now recheck the d_seq of dentry0. Then we | ||
257 | +---------------------+ check inode and look up the next element. | ||
258 | | | ||
259 | dentry2 V | ||
260 | +---------------------+ Note: if dentry0 is now modified, lookup is | ||
261 | | name: "npiggin" | not necessarily invalid, so we need only keep a | ||
262 | | inode: 543 | parent for d_seq verification, and grandparents | ||
263 | | children:"a.c", ... | can be forgotten. | ||
264 | +---------------------+ | ||
265 | | | ||
266 | dentry3 V | ||
267 | +---------------------+ At this point we have our destination dentry. | ||
268 | | name: "a.c" | We now take its d_lock, verify d_seq of this | ||
269 | | inode: 14221 | dentry. If that checks out, we can increment | ||
270 | | children:NULL | its refcount because we're holding d_lock. | ||
271 | +---------------------+ | ||
272 | |||
273 | Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock, | ||
274 | re-checking its d_seq, and then incrementing its refcount is called | ||
275 | "dropping rcu" or dropping from rcu-walk into ref-walk mode. | ||
276 | |||
277 | It is, in some sense, a bit of a house of cards. If the seqcount check of the | ||
278 | parent snapshot fails, the house comes down, because we had closed the d_seq | ||
279 | section on the grandparent, so we have nothing left to stand on. In that case, | ||
280 | the path walk must be fully restarted (which we do in ref-walk mode, to avoid | ||
281 | live locks). It is costly to have a full restart, but fortunately they are | ||
282 | quite rare. | ||
283 | |||
284 | When we reach a point where sleeping is required, or a filesystem callout | ||
285 | requires ref-walk, then instead of restarting the walk, we attempt to drop rcu | ||
286 | at the last known good dentry we have. Avoiding a full restart in ref-walk in | ||
287 | these cases is fundamental for performance and scalability because blocking | ||
288 | operations such as creates and unlinks are not uncommon. | ||
289 | |||
290 | The detailed design for rcu-walk is like this: | ||
291 | * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk. | ||
292 | * Take the RCU lock for the entire path walk, starting with the acquiring | ||
293 | of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are | ||
294 | not required for dentry persistence. | ||
295 | * synchronize_rcu is called when unregistering a filesystem, so we can | ||
296 | access d_ops and i_ops during rcu-walk. | ||
297 | * Similarly take the vfsmount lock for the entire path walk. So now mnt | ||
298 | refcounts are not required for persistence. Also we are free to perform mount | ||
299 | lookups, and to assume dentry mount points and mount roots are stable up and | ||
300 | down the path. | ||
301 | * Have a per-dentry seqlock to protect the dentry name, parent, and inode, | ||
302 | so we can load this tuple atomically, and also check whether any of its | ||
303 | members have changed. | ||
304 | * Dentry lookups (based on parent, candidate string tuple) recheck the parent | ||
305 | sequence after the child is found in case anything changed in the parent | ||
306 | during the path walk. | ||
307 | * inode is also RCU protected so we can load d_inode and use the inode for | ||
308 | limited things. | ||
309 | * i_mode, i_uid, i_gid can be tested for exec permissions during path walk. | ||
310 | * i_op can be loaded. | ||
311 | * When the destination dentry is reached, drop rcu there (ie. take d_lock, | ||
312 | verify d_seq, increment refcount). | ||
313 | * If seqlock verification fails anywhere along the path, do a full restart | ||
314 | of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of | ||
315 | a better errno) to signal an rcu-walk failure. | ||
316 | |||
317 | The cases where rcu-walk cannot continue are: | ||
318 | * NULL dentry (ie. any uncached path element) | ||
319 | * parent with d_inode->i_op->permission or ACLs | ||
320 | * dentries with d_revalidate | ||
321 | * Following links | ||
322 | |||
323 | In future patches, permission checks and d_revalidate become rcu-walk aware. It | ||
324 | may be possible eventually to make following links rcu-walk aware. | ||
325 | |||
326 | Uncached path elements will always require dropping to ref-walk mode, at the | ||
327 | very least because i_mutex needs to be grabbed, and objects allocated. | ||
328 | |||
329 | Final note: | ||
330 | "store-free" path walking is not strictly store free. We take vfsmount lock | ||
331 | and refcounts (both of which can be made per-cpu), and we also store to the | ||
332 | stack (which is essentially CPU-local), and we also have to take locks and | ||
333 | refcount on final dentry. | ||
334 | |||
335 | The point is that shared data, where practically possible, is not locked | ||
336 | or stored into. The result is massive improvements in performance and | ||
337 | scalability of path resolution. | ||
338 | |||
339 | |||
340 | Papers and other documentation on dcache locking | ||
341 | ================================================ | ||
342 | |||
343 | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). | ||
344 | |||
345 | 2. http://lse.sourceforge.net/locking/dcache/dcache.html | ||