diff options
author | Mike Marshall <hubcap@omnibond.com> | 2016-08-01 14:01:40 -0400 |
---|---|---|
committer | Martin Brandenburg <martin@omnibond.com> | 2016-08-02 15:39:14 -0400 |
commit | 302f0493f0bfaabd6f77ce7bfaa12620abf74948 (patch) | |
tree | ce1998a12a23deefe4e52b43acdaac7ecc9969e6 | |
parent | 8bbb20a863ca72dfb9025a4653f21b5abf926d20 (diff) |
Orangefs: update orangefs.txt
Describe use of jiffy-based timeout values involved in inode maintenance.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
-rw-r--r-- | Documentation/filesystems/orangefs.txt | 50 |
1 files changed, 46 insertions, 4 deletions
diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt index e1a0056a365f..1dfdec790946 100644 --- a/Documentation/filesystems/orangefs.txt +++ b/Documentation/filesystems/orangefs.txt | |||
@@ -281,7 +281,7 @@ on the wait queue and one attempt is made to recycle them. Obviously, | |||
281 | if the client-core stays dead too long, the arbitrary userspace processes | 281 | if the client-core stays dead too long, the arbitrary userspace processes |
282 | trying to use Orangefs will be negatively affected. Waiting ops | 282 | trying to use Orangefs will be negatively affected. Waiting ops |
283 | that can't be serviced will be removed from the request list and | 283 | that can't be serviced will be removed from the request list and |
284 | have their states set to "given up". In-progress ops that can't | 284 | have their states set to "given up". In-progress ops that can't |
285 | be serviced will be removed from the in_progress hash table and | 285 | be serviced will be removed from the in_progress hash table and |
286 | have their states set to "given up". | 286 | have their states set to "given up". |
287 | 287 | ||
@@ -338,7 +338,7 @@ particular response. | |||
338 | PVFS2_VFS_OP_STATFS | 338 | PVFS2_VFS_OP_STATFS |
339 | fill a pvfs2_statfs_response_t with useless info <g>. It is hard for | 339 | fill a pvfs2_statfs_response_t with useless info <g>. It is hard for |
340 | us to know, in a timely fashion, these statistics about our | 340 | us to know, in a timely fashion, these statistics about our |
341 | distributed network filesystem. | 341 | distributed network filesystem. |
342 | 342 | ||
343 | PVFS2_VFS_OP_FS_MOUNT | 343 | PVFS2_VFS_OP_FS_MOUNT |
344 | fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref | 344 | fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref |
@@ -386,7 +386,7 @@ responses: | |||
386 | 386 | ||
387 | io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) | 387 | io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) |
388 | io_array[1].iov_len = sizeof(int32_t) | 388 | io_array[1].iov_len = sizeof(int32_t) |
389 | 389 | ||
390 | io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) | 390 | io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) |
391 | io_array[2].iov_len = sizeof(int64_t) | 391 | io_array[2].iov_len = sizeof(int64_t) |
392 | 392 | ||
@@ -402,5 +402,47 @@ Readdir responses initialize the fifth element io_array like this: | |||
402 | io_array[4].iov_len = contents of member trailer_size (PVFS_size) | 402 | io_array[4].iov_len = contents of member trailer_size (PVFS_size) |
403 | from out_downcall member of global variable | 403 | from out_downcall member of global variable |
404 | vfs_request | 404 | vfs_request |
405 | 405 | ||
406 | Orangefs exploits the dcache in order to avoid sending redundant | ||
407 | requests to userspace. We keep object inode attributes up-to-date with | ||
408 | orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to | ||
409 | help it decide whether or not to update an inode: "new" and "bypass". | ||
410 | Orangefs keeps private data in an object's inode that includes a short | ||
411 | timeout value, getattr_time, which allows any iteration of | ||
412 | orangefs_inode_getattr to know how long it has been since the inode was | ||
413 | updated. When the object is not new (new == 0) and the bypass flag is not | ||
414 | set (bypass == 0) orangefs_inode_getattr returns without updating the inode | ||
415 | if getattr_time has not timed out. Getattr_time is updated each time the | ||
416 | inode is updated. | ||
417 | |||
418 | Creation of a new object (file, dir, sym-link) includes the evaluation of | ||
419 | its pathname, resulting in a negative directory entry for the object. | ||
420 | A new inode is allocated and associated with the dentry, turning it from | ||
421 | a negative dentry into a "productive full member of society". Orangefs | ||
422 | obtains the new inode from Linux with new_inode() and associates | ||
423 | the inode with the dentry by sending the pair back to Linux with | ||
424 | d_instantiate(). | ||
425 | |||
426 | The evaluation of a pathname for an object resolves to its corresponding | ||
427 | dentry. If there is no corresponding dentry, one is created for it in | ||
428 | the dcache. Whenever a dentry is modified or verified Orangefs stores a | ||
429 | short timeout value in the dentry's d_time, and the dentry will be trusted | ||
430 | for that amount of time. Orangefs is a network filesystem, and objects | ||
431 | can potentially change out-of-band with any particular Orangefs kernel module | ||
432 | instance, so trusting a dentry is risky. The alternative to trusting | ||
433 | dentries is to always obtain the needed information from userspace - at | ||
434 | least a trip to the client-core, maybe to the servers. Obtaining information | ||
435 | from a dentry is cheap, obtaining it from userspace is relatively expensive, | ||
436 | hence the motivation to use the dentry when possible. | ||
437 | |||
438 | The timeout values d_time and getattr_time are jiffy based, and the | ||
439 | code is designed to avoid the jiffy-wrap problem: | ||
440 | |||
441 | "In general, if the clock may have wrapped around more than once, there | ||
442 | is no way to tell how much time has elapsed. However, if the times t1 | ||
443 | and t2 are known to be fairly close, we can reliably compute the | ||
444 | difference in a way that takes into account the possibility that the | ||
445 | clock may have wrapped between times." | ||
446 | |||
447 | from course notes by instructor Andy Wang | ||
406 | 448 | ||