diff options
Diffstat (limited to 'Documentation/filesystems/orangefs.txt')
-rw-r--r-- | Documentation/filesystems/orangefs.txt | 79 |
1 files changed, 66 insertions, 13 deletions
diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt index 925a53e52097..e1a0056a365f 100644 --- a/Documentation/filesystems/orangefs.txt +++ b/Documentation/filesystems/orangefs.txt | |||
@@ -221,18 +221,71 @@ contains the "downcall" which expresses the results of the request. | |||
221 | 221 | ||
222 | The slab allocator is used to keep a cache of op structures handy. | 222 | The slab allocator is used to keep a cache of op structures handy. |
223 | 223 | ||
224 | The life cycle of a typical op goes like this: | 224 | At init time the kernel module defines and initializes a request list |
225 | 225 | and an in_progress hash table to keep track of all the ops that are | |
226 | - obtain and initialize an op structure from the op_cache. | 226 | in flight at any given time. |
227 | 227 | ||
228 | - queue the op to the pvfs device so that its upcall data can be | 228 | Ops are stateful: |
229 | read by userspace. | 229 | |
230 | 230 | * unknown - op was just initialized | |
231 | - wait for userspace to write downcall data back to the pvfs device. | 231 | * waiting - op is on request_list (upward bound) |
232 | 232 | * inprogr - op is in progress (waiting for downcall) | |
233 | - consume the downcall and return the op struct to the op_cache. | 233 | * serviced - op has matching downcall; ok |
234 | 234 | * purged - op has to start a timer since client-core | |
235 | Some ops are atypical with respect to their payloads: readdir and io ops. | 235 | exited uncleanly before servicing op |
236 | * given up - submitter has given up waiting for it | ||
237 | |||
238 | When some arbitrary userspace program needs to perform a | ||
239 | filesystem operation on Orangefs (readdir, I/O, create, whatever) | ||
240 | an op structure is initialized and tagged with a distinguishing ID | ||
241 | number. The upcall part of the op is filled out, and the op is | ||
242 | passed to the "service_operation" function. | ||
243 | |||
244 | Service_operation changes the op's state to "waiting", puts | ||
245 | it on the request list, and signals the Orangefs file_operations.poll | ||
246 | function through a wait queue. Userspace is polling the pseudo-device | ||
247 | and thus becomes aware of the upcall request that needs to be read. | ||
248 | |||
249 | When the Orangefs file_operations.read function is triggered, the | ||
250 | request list is searched for an op that seems ready-to-process. | ||
251 | The op is removed from the request list. The tag from the op and | ||
252 | the filled-out upcall struct are copy_to_user'ed back to userspace. | ||
253 | |||
254 | If any of these (and some additional protocol) copy_to_users fail, | ||
255 | the op's state is set to "waiting" and the op is added back to | ||
256 | the request list. Otherwise, the op's state is changed to "in progress", | ||
257 | and the op is hashed on its tag and put onto the end of a list in the | ||
258 | in_progress hash table at the index the tag hashed to. | ||
259 | |||
260 | When userspace has assembled the response to the upcall, it | ||
261 | writes the response, which includes the distinguishing tag, back to | ||
262 | the pseudo device in a series of io_vecs. This triggers the Orangefs | ||
263 | file_operations.write_iter function to find the op with the associated | ||
264 | tag and remove it from the in_progress hash table. As long as the op's | ||
265 | state is not "canceled" or "given up", its state is set to "serviced". | ||
266 | The file_operations.write_iter function returns to the waiting vfs, | ||
267 | and back to service_operation through wait_for_matching_downcall. | ||
268 | |||
269 | Service operation returns to its caller with the op's downcall | ||
270 | part (the response to the upcall) filled out. | ||
271 | |||
272 | The "client-core" is the bridge between the kernel module and | ||
273 | userspace. The client-core is a daemon. The client-core has an | ||
274 | associated watchdog daemon. If the client-core is ever signaled | ||
275 | to die, the watchdog daemon restarts the client-core. Even though | ||
276 | the client-core is restarted "right away", there is a period of | ||
277 | time during such an event that the client-core is dead. A dead client-core | ||
278 | can't be triggered by the Orangefs file_operations.poll function. | ||
279 | Ops that pass through service_operation during a "dead spell" can timeout | ||
280 | on the wait queue and one attempt is made to recycle them. Obviously, | ||
281 | if the client-core stays dead too long, the arbitrary userspace processes | ||
282 | trying to use Orangefs will be negatively affected. Waiting ops | ||
283 | that can't be serviced will be removed from the request list and | ||
284 | have their states set to "given up". In-progress ops that can't | ||
285 | be serviced will be removed from the in_progress hash table and | ||
286 | have their states set to "given up". | ||
287 | |||
288 | Readdir and I/O ops are atypical with respect to their payloads. | ||
236 | 289 | ||
237 | - readdir ops use the smaller of the two pre-allocated pre-partitioned | 290 | - readdir ops use the smaller of the two pre-allocated pre-partitioned |
238 | memory buffers. The readdir buffer is only available to userspace. | 291 | memory buffers. The readdir buffer is only available to userspace. |
@@ -311,7 +364,7 @@ particular response. | |||
311 | jamb everything needed to represent a pvfs2_readdir_response_t into | 364 | jamb everything needed to represent a pvfs2_readdir_response_t into |
312 | the readdir buffer descriptor specified in the upcall. | 365 | the readdir buffer descriptor specified in the upcall. |
313 | 366 | ||
314 | writev() on /dev/pvfs2-req is used to pass responses to the requests | 367 | Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests |
315 | made by the kernel side. | 368 | made by the kernel side. |
316 | 369 | ||
317 | A buffer_list containing: | 370 | A buffer_list containing: |