aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorMike Marshall <hubcap@omnibond.com>2016-02-26 14:39:08 -0500
committerMike Marshall <hubcap@omnibond.com>2016-02-26 14:39:08 -0500
commit9f08cfe94417f782393330cbfc95617c04f051c2 (patch)
tree19b70ed52058ffd95a661ba9db466c4c472fae89 /Documentation
parentca9f518eadeb7edd8e438a6542d3caec9bc3bb74 (diff)
Orangefs: update orangefs.txt
Al Viro has cleaned up the way ops are processed and waited for, now orangefs.txt has an overview of how it works. Several recent related commits have added to the comments in the code as well. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/filesystems/orangefs.txt79
1 files changed, 66 insertions, 13 deletions
diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt
index 925a53e52097..e1a0056a365f 100644
--- a/Documentation/filesystems/orangefs.txt
+++ b/Documentation/filesystems/orangefs.txt
@@ -221,18 +221,71 @@ contains the "downcall" which expresses the results of the request.
221 221
222The slab allocator is used to keep a cache of op structures handy. 222The slab allocator is used to keep a cache of op structures handy.
223 223
224The life cycle of a typical op goes like this: 224At init time the kernel module defines and initializes a request list
225 225and an in_progress hash table to keep track of all the ops that are
226 - obtain and initialize an op structure from the op_cache. 226in flight at any given time.
227 227
228 - queue the op to the pvfs device so that its upcall data can be 228Ops are stateful:
229 read by userspace. 229
230 230 * unknown - op was just initialized
231 - wait for userspace to write downcall data back to the pvfs device. 231 * waiting - op is on request_list (upward bound)
232 232 * inprogr - op is in progress (waiting for downcall)
233 - consume the downcall and return the op struct to the op_cache. 233 * serviced - op has matching downcall; ok
234 234 * purged - op has to start a timer since client-core
235Some ops are atypical with respect to their payloads: readdir and io ops. 235 exited uncleanly before servicing op
236 * given up - submitter has given up waiting for it
237
238When some arbitrary userspace program needs to perform a
239filesystem operation on Orangefs (readdir, I/O, create, whatever)
240an op structure is initialized and tagged with a distinguishing ID
241number. The upcall part of the op is filled out, and the op is
242passed to the "service_operation" function.
243
244Service_operation changes the op's state to "waiting", puts
245it on the request list, and signals the Orangefs file_operations.poll
246function through a wait queue. Userspace is polling the pseudo-device
247and thus becomes aware of the upcall request that needs to be read.
248
249When the Orangefs file_operations.read function is triggered, the
250request list is searched for an op that seems ready-to-process.
251The op is removed from the request list. The tag from the op and
252the filled-out upcall struct are copy_to_user'ed back to userspace.
253
254If any of these (and some additional protocol) copy_to_users fail,
255the op's state is set to "waiting" and the op is added back to
256the request list. Otherwise, the op's state is changed to "in progress",
257and the op is hashed on its tag and put onto the end of a list in the
258in_progress hash table at the index the tag hashed to.
259
260When userspace has assembled the response to the upcall, it
261writes the response, which includes the distinguishing tag, back to
262the pseudo device in a series of io_vecs. This triggers the Orangefs
263file_operations.write_iter function to find the op with the associated
264tag and remove it from the in_progress hash table. As long as the op's
265state is not "canceled" or "given up", its state is set to "serviced".
266The file_operations.write_iter function returns to the waiting vfs,
267and back to service_operation through wait_for_matching_downcall.
268
269Service operation returns to its caller with the op's downcall
270part (the response to the upcall) filled out.
271
272The "client-core" is the bridge between the kernel module and
273userspace. The client-core is a daemon. The client-core has an
274associated watchdog daemon. If the client-core is ever signaled
275to die, the watchdog daemon restarts the client-core. Even though
276the client-core is restarted "right away", there is a period of
277time during such an event that the client-core is dead. A dead client-core
278can't be triggered by the Orangefs file_operations.poll function.
279Ops that pass through service_operation during a "dead spell" can timeout
280on the wait queue and one attempt is made to recycle them. Obviously,
281if the client-core stays dead too long, the arbitrary userspace processes
282trying to use Orangefs will be negatively affected. Waiting ops
283that can't be serviced will be removed from the request list and
284have their states set to "given up". In-progress ops that can't
285be serviced will be removed from the in_progress hash table and
286have their states set to "given up".
287
288Readdir and I/O ops are atypical with respect to their payloads.
236 289
237 - readdir ops use the smaller of the two pre-allocated pre-partitioned 290 - readdir ops use the smaller of the two pre-allocated pre-partitioned
238 memory buffers. The readdir buffer is only available to userspace. 291 memory buffers. The readdir buffer is only available to userspace.
@@ -311,7 +364,7 @@ particular response.
311 jamb everything needed to represent a pvfs2_readdir_response_t into 364 jamb everything needed to represent a pvfs2_readdir_response_t into
312 the readdir buffer descriptor specified in the upcall. 365 the readdir buffer descriptor specified in the upcall.
313 366
314writev() on /dev/pvfs2-req is used to pass responses to the requests 367Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
315made by the kernel side. 368made by the kernel side.
316 369
317A buffer_list containing: 370A buffer_list containing: