diff options
Diffstat (limited to 'Documentation/filesystems/relayfs.txt')
-rw-r--r-- | Documentation/filesystems/relayfs.txt | 362 |
1 files changed, 362 insertions, 0 deletions
diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt new file mode 100644 index 000000000000..d24e1b0d4f39 --- /dev/null +++ b/Documentation/filesystems/relayfs.txt | |||
@@ -0,0 +1,362 @@ | |||
1 | |||
2 | relayfs - a high-speed data relay filesystem | ||
3 | ============================================ | ||
4 | |||
5 | relayfs is a filesystem designed to provide an efficient mechanism for | ||
6 | tools and facilities to relay large and potentially sustained streams | ||
7 | of data from kernel space to user space. | ||
8 | |||
9 | The main abstraction of relayfs is the 'channel'. A channel consists | ||
10 | of a set of per-cpu kernel buffers each represented by a file in the | ||
11 | relayfs filesystem. Kernel clients write into a channel using | ||
12 | efficient write functions which automatically log to the current cpu's | ||
13 | channel buffer. User space applications mmap() the per-cpu files and | ||
14 | retrieve the data as it becomes available. | ||
15 | |||
16 | The format of the data logged into the channel buffers is completely | ||
17 | up to the relayfs client; relayfs does however provide hooks which | ||
18 | allow clients to impose some stucture on the buffer data. Nor does | ||
19 | relayfs implement any form of data filtering - this also is left to | ||
20 | the client. The purpose is to keep relayfs as simple as possible. | ||
21 | |||
22 | This document provides an overview of the relayfs API. The details of | ||
23 | the function parameters are documented along with the functions in the | ||
24 | filesystem code - please see that for details. | ||
25 | |||
26 | Semantics | ||
27 | ========= | ||
28 | |||
29 | Each relayfs channel has one buffer per CPU, each buffer has one or | ||
30 | more sub-buffers. Messages are written to the first sub-buffer until | ||
31 | it is too full to contain a new message, in which case it it is | ||
32 | written to the next (if available). Messages are never split across | ||
33 | sub-buffers. At this point, userspace can be notified so it empties | ||
34 | the first sub-buffer, while the kernel continues writing to the next. | ||
35 | |||
36 | When notified that a sub-buffer is full, the kernel knows how many | ||
37 | bytes of it are padding i.e. unused. Userspace can use this knowledge | ||
38 | to copy only valid data. | ||
39 | |||
40 | After copying it, userspace can notify the kernel that a sub-buffer | ||
41 | has been consumed. | ||
42 | |||
43 | relayfs can operate in a mode where it will overwrite data not yet | ||
44 | collected by userspace, and not wait for it to consume it. | ||
45 | |||
46 | relayfs itself does not provide for communication of such data between | ||
47 | userspace and kernel, allowing the kernel side to remain simple and not | ||
48 | impose a single interface on userspace. It does provide a separate | ||
49 | helper though, described below. | ||
50 | |||
51 | klog, relay-app & librelay | ||
52 | ========================== | ||
53 | |||
54 | relayfs itself is ready to use, but to make things easier, two | ||
55 | additional systems are provided. klog is a simple wrapper to make | ||
56 | writing formatted text or raw data to a channel simpler, regardless of | ||
57 | whether a channel to write into exists or not, or whether relayfs is | ||
58 | compiled into the kernel or is configured as a module. relay-app is | ||
59 | the kernel counterpart of userspace librelay.c, combined these two | ||
60 | files provide glue to easily stream data to disk, without having to | ||
61 | bother with housekeeping. klog and relay-app can be used together, | ||
62 | with klog providing high-level logging functions to the kernel and | ||
63 | relay-app taking care of kernel-user control and disk-logging chores. | ||
64 | |||
65 | It is possible to use relayfs without relay-app & librelay, but you'll | ||
66 | have to implement communication between userspace and kernel, allowing | ||
67 | both to convey the state of buffers (full, empty, amount of padding). | ||
68 | |||
69 | klog, relay-app and librelay can be found in the relay-apps tarball on | ||
70 | http://relayfs.sourceforge.net | ||
71 | |||
72 | The relayfs user space API | ||
73 | ========================== | ||
74 | |||
75 | relayfs implements basic file operations for user space access to | ||
76 | relayfs channel buffer data. Here are the file operations that are | ||
77 | available and some comments regarding their behavior: | ||
78 | |||
79 | open() enables user to open an _existing_ buffer. | ||
80 | |||
81 | mmap() results in channel buffer being mapped into the caller's | ||
82 | memory space. Note that you can't do a partial mmap - you must | ||
83 | map the entire file, which is NRBUF * SUBBUFSIZE. | ||
84 | |||
85 | read() read the contents of a channel buffer. The bytes read are | ||
86 | 'consumed' by the reader i.e. they won't be available again | ||
87 | to subsequent reads. If the channel is being used in | ||
88 | no-overwrite mode (the default), it can be read at any time | ||
89 | even if there's an active kernel writer. If the channel is | ||
90 | being used in overwrite mode and there are active channel | ||
91 | writers, results may be unpredictable - users should make | ||
92 | sure that all logging to the channel has ended before using | ||
93 | read() with overwrite mode. | ||
94 | |||
95 | poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are | ||
96 | notified when sub-buffer boundaries are crossed. | ||
97 | |||
98 | close() decrements the channel buffer's refcount. When the refcount | ||
99 | reaches 0 i.e. when no process or kernel client has the buffer | ||
100 | open, the channel buffer is freed. | ||
101 | |||
102 | |||
103 | In order for a user application to make use of relayfs files, the | ||
104 | relayfs filesystem must be mounted. For example, | ||
105 | |||
106 | mount -t relayfs relayfs /mnt/relay | ||
107 | |||
108 | NOTE: relayfs doesn't need to be mounted for kernel clients to create | ||
109 | or use channels - it only needs to be mounted when user space | ||
110 | applications need access to the buffer data. | ||
111 | |||
112 | |||
113 | The relayfs kernel API | ||
114 | ====================== | ||
115 | |||
116 | Here's a summary of the API relayfs provides to in-kernel clients: | ||
117 | |||
118 | |||
119 | channel management functions: | ||
120 | |||
121 | relay_open(base_filename, parent, subbuf_size, n_subbufs, | ||
122 | callbacks) | ||
123 | relay_close(chan) | ||
124 | relay_flush(chan) | ||
125 | relay_reset(chan) | ||
126 | relayfs_create_dir(name, parent) | ||
127 | relayfs_remove_dir(dentry) | ||
128 | |||
129 | channel management typically called on instigation of userspace: | ||
130 | |||
131 | relay_subbufs_consumed(chan, cpu, subbufs_consumed) | ||
132 | |||
133 | write functions: | ||
134 | |||
135 | relay_write(chan, data, length) | ||
136 | __relay_write(chan, data, length) | ||
137 | relay_reserve(chan, length) | ||
138 | |||
139 | callbacks: | ||
140 | |||
141 | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | ||
142 | buf_mapped(buf, filp) | ||
143 | buf_unmapped(buf, filp) | ||
144 | |||
145 | helper functions: | ||
146 | |||
147 | relay_buf_full(buf) | ||
148 | subbuf_start_reserve(buf, length) | ||
149 | |||
150 | |||
151 | Creating a channel | ||
152 | ------------------ | ||
153 | |||
154 | relay_open() is used to create a channel, along with its per-cpu | ||
155 | channel buffers. Each channel buffer will have an associated file | ||
156 | created for it in the relayfs filesystem, which can be opened and | ||
157 | mmapped from user space if desired. The files are named | ||
158 | basename0...basenameN-1 where N is the number of online cpus, and by | ||
159 | default will be created in the root of the filesystem. If you want a | ||
160 | directory structure to contain your relayfs files, you can create it | ||
161 | with relayfs_create_dir() and pass the parent directory to | ||
162 | relay_open(). Clients are responsible for cleaning up any directory | ||
163 | structure they create when the channel is closed - use | ||
164 | relayfs_remove_dir() for that. | ||
165 | |||
166 | The total size of each per-cpu buffer is calculated by multiplying the | ||
167 | number of sub-buffers by the sub-buffer size passed into relay_open(). | ||
168 | The idea behind sub-buffers is that they're basically an extension of | ||
169 | double-buffering to N buffers, and they also allow applications to | ||
170 | easily implement random-access-on-buffer-boundary schemes, which can | ||
171 | be important for some high-volume applications. The number and size | ||
172 | of sub-buffers is completely dependent on the application and even for | ||
173 | the same application, different conditions will warrant different | ||
174 | values for these parameters at different times. Typically, the right | ||
175 | values to use are best decided after some experimentation; in general, | ||
176 | though, it's safe to assume that having only 1 sub-buffer is a bad | ||
177 | idea - you're guaranteed to either overwrite data or lose events | ||
178 | depending on the channel mode being used. | ||
179 | |||
180 | Channel 'modes' | ||
181 | --------------- | ||
182 | |||
183 | relayfs channels can be used in either of two modes - 'overwrite' or | ||
184 | 'no-overwrite'. The mode is entirely determined by the implementation | ||
185 | of the subbuf_start() callback, as described below. In 'overwrite' | ||
186 | mode, also known as 'flight recorder' mode, writes continuously cycle | ||
187 | around the buffer and will never fail, but will unconditionally | ||
188 | overwrite old data regardless of whether it's actually been consumed. | ||
189 | In no-overwrite mode, writes will fail i.e. data will be lost, if the | ||
190 | number of unconsumed sub-buffers equals the total number of | ||
191 | sub-buffers in the channel. It should be clear that if there is no | ||
192 | consumer or if the consumer can't consume sub-buffers fast enought, | ||
193 | data will be lost in either case; the only difference is whether data | ||
194 | is lost from the beginning or the end of a buffer. | ||
195 | |||
196 | As explained above, a relayfs channel is made of up one or more | ||
197 | per-cpu channel buffers, each implemented as a circular buffer | ||
198 | subdivided into one or more sub-buffers. Messages are written into | ||
199 | the current sub-buffer of the channel's current per-cpu buffer via the | ||
200 | write functions described below. Whenever a message can't fit into | ||
201 | the current sub-buffer, because there's no room left for it, the | ||
202 | client is notified via the subbuf_start() callback that a switch to a | ||
203 | new sub-buffer is about to occur. The client uses this callback to 1) | ||
204 | initialize the next sub-buffer if appropriate 2) finalize the previous | ||
205 | sub-buffer if appropriate and 3) return a boolean value indicating | ||
206 | whether or not to actually go ahead with the sub-buffer switch. | ||
207 | |||
208 | To implement 'no-overwrite' mode, the userspace client would provide | ||
209 | an implementation of the subbuf_start() callback something like the | ||
210 | following: | ||
211 | |||
212 | static int subbuf_start(struct rchan_buf *buf, | ||
213 | void *subbuf, | ||
214 | void *prev_subbuf, | ||
215 | unsigned int prev_padding) | ||
216 | { | ||
217 | if (prev_subbuf) | ||
218 | *((unsigned *)prev_subbuf) = prev_padding; | ||
219 | |||
220 | if (relay_buf_full(buf)) | ||
221 | return 0; | ||
222 | |||
223 | subbuf_start_reserve(buf, sizeof(unsigned int)); | ||
224 | |||
225 | return 1; | ||
226 | } | ||
227 | |||
228 | If the current buffer is full i.e. all sub-buffers remain unconsumed, | ||
229 | the callback returns 0 to indicate that the buffer switch should not | ||
230 | occur yet i.e. until the consumer has had a chance to read the current | ||
231 | set of ready sub-buffers. For the relay_buf_full() function to make | ||
232 | sense, the consumer is reponsible for notifying relayfs when | ||
233 | sub-buffers have been consumed via relay_subbufs_consumed(). Any | ||
234 | subsequent attempts to write into the buffer will again invoke the | ||
235 | subbuf_start() callback with the same parameters; only when the | ||
236 | consumer has consumed one or more of the ready sub-buffers will | ||
237 | relay_buf_full() return 0, in which case the buffer switch can | ||
238 | continue. | ||
239 | |||
240 | The implementation of the subbuf_start() callback for 'overwrite' mode | ||
241 | would be very similar: | ||
242 | |||
243 | static int subbuf_start(struct rchan_buf *buf, | ||
244 | void *subbuf, | ||
245 | void *prev_subbuf, | ||
246 | unsigned int prev_padding) | ||
247 | { | ||
248 | if (prev_subbuf) | ||
249 | *((unsigned *)prev_subbuf) = prev_padding; | ||
250 | |||
251 | subbuf_start_reserve(buf, sizeof(unsigned int)); | ||
252 | |||
253 | return 1; | ||
254 | } | ||
255 | |||
256 | In this case, the relay_buf_full() check is meaningless and the | ||
257 | callback always returns 1, causing the buffer switch to occur | ||
258 | unconditionally. It's also meaningless for the client to use the | ||
259 | relay_subbufs_consumed() function in this mode, as it's never | ||
260 | consulted. | ||
261 | |||
262 | The default subbuf_start() implementation, used if the client doesn't | ||
263 | define any callbacks, or doesn't define the subbuf_start() callback, | ||
264 | implements the simplest possible 'no-overwrite' mode i.e. it does | ||
265 | nothing but return 0. | ||
266 | |||
267 | Header information can be reserved at the beginning of each sub-buffer | ||
268 | by calling the subbuf_start_reserve() helper function from within the | ||
269 | subbuf_start() callback. This reserved area can be used to store | ||
270 | whatever information the client wants. In the example above, room is | ||
271 | reserved in each sub-buffer to store the padding count for that | ||
272 | sub-buffer. This is filled in for the previous sub-buffer in the | ||
273 | subbuf_start() implementation; the padding value for the previous | ||
274 | sub-buffer is passed into the subbuf_start() callback along with a | ||
275 | pointer to the previous sub-buffer, since the padding value isn't | ||
276 | known until a sub-buffer is filled. The subbuf_start() callback is | ||
277 | also called for the first sub-buffer when the channel is opened, to | ||
278 | give the client a chance to reserve space in it. In this case the | ||
279 | previous sub-buffer pointer passed into the callback will be NULL, so | ||
280 | the client should check the value of the prev_subbuf pointer before | ||
281 | writing into the previous sub-buffer. | ||
282 | |||
283 | Writing to a channel | ||
284 | -------------------- | ||
285 | |||
286 | kernel clients write data into the current cpu's channel buffer using | ||
287 | relay_write() or __relay_write(). relay_write() is the main logging | ||
288 | function - it uses local_irqsave() to protect the buffer and should be | ||
289 | used if you might be logging from interrupt context. If you know | ||
290 | you'll never be logging from interrupt context, you can use | ||
291 | __relay_write(), which only disables preemption. These functions | ||
292 | don't return a value, so you can't determine whether or not they | ||
293 | failed - the assumption is that you wouldn't want to check a return | ||
294 | value in the fast logging path anyway, and that they'll always succeed | ||
295 | unless the buffer is full and no-overwrite mode is being used, in | ||
296 | which case you can detect a failed write in the subbuf_start() | ||
297 | callback by calling the relay_buf_full() helper function. | ||
298 | |||
299 | relay_reserve() is used to reserve a slot in a channel buffer which | ||
300 | can be written to later. This would typically be used in applications | ||
301 | that need to write directly into a channel buffer without having to | ||
302 | stage data in a temporary buffer beforehand. Because the actual write | ||
303 | may not happen immediately after the slot is reserved, applications | ||
304 | using relay_reserve() can keep a count of the number of bytes actually | ||
305 | written, either in space reserved in the sub-buffers themselves or as | ||
306 | a separate array. See the 'reserve' example in the relay-apps tarball | ||
307 | at http://relayfs.sourceforge.net for an example of how this can be | ||
308 | done. Because the write is under control of the client and is | ||
309 | separated from the reserve, relay_reserve() doesn't protect the buffer | ||
310 | at all - it's up to the client to provide the appropriate | ||
311 | synchronization when using relay_reserve(). | ||
312 | |||
313 | Closing a channel | ||
314 | ----------------- | ||
315 | |||
316 | The client calls relay_close() when it's finished using the channel. | ||
317 | The channel and its associated buffers are destroyed when there are no | ||
318 | longer any references to any of the channel buffers. relay_flush() | ||
319 | forces a sub-buffer switch on all the channel buffers, and can be used | ||
320 | to finalize and process the last sub-buffers before the channel is | ||
321 | closed. | ||
322 | |||
323 | Misc | ||
324 | ---- | ||
325 | |||
326 | Some applications may want to keep a channel around and re-use it | ||
327 | rather than open and close a new channel for each use. relay_reset() | ||
328 | can be used for this purpose - it resets a channel to its initial | ||
329 | state without reallocating channel buffer memory or destroying | ||
330 | existing mappings. It should however only be called when it's safe to | ||
331 | do so i.e. when the channel isn't currently being written to. | ||
332 | |||
333 | Finally, there are a couple of utility callbacks that can be used for | ||
334 | different purposes. buf_mapped() is called whenever a channel buffer | ||
335 | is mmapped from user space and buf_unmapped() is called when it's | ||
336 | unmapped. The client can use this notification to trigger actions | ||
337 | within the kernel application, such as enabling/disabling logging to | ||
338 | the channel. | ||
339 | |||
340 | |||
341 | Resources | ||
342 | ========= | ||
343 | |||
344 | For news, example code, mailing list, etc. see the relayfs homepage: | ||
345 | |||
346 | http://relayfs.sourceforge.net | ||
347 | |||
348 | |||
349 | Credits | ||
350 | ======= | ||
351 | |||
352 | The ideas and specs for relayfs came about as a result of discussions | ||
353 | on tracing involving the following: | ||
354 | |||
355 | Michel Dagenais <michel.dagenais@polymtl.ca> | ||
356 | Richard Moore <richardj_moore@uk.ibm.com> | ||
357 | Bob Wisniewski <bob@watson.ibm.com> | ||
358 | Karim Yaghmour <karim@opersys.com> | ||
359 | Tom Zanussi <zanussi@us.ibm.com> | ||
360 | |||
361 | Also thanks to Hubertus Franke for a lot of useful suggestions and bug | ||
362 | reports. | ||