diff options
author | Anton Altaparmakov <aia21@cantab.net> | 2006-01-19 11:39:33 -0500 |
---|---|---|
committer | Anton Altaparmakov <aia21@cantab.net> | 2006-01-19 11:39:33 -0500 |
commit | 944d79559d154c12becde0dab327016cf438f46c (patch) | |
tree | 50c101806f4d3b6585222dda060559eb4f3e005a /Documentation/filesystems | |
parent | d087e4bdd24ebe3ae3d0b265b6573ec901af4b4b (diff) | |
parent | 0f36b018b2e314d45af86449f1a97facb1fbe300 (diff) |
Merge branch 'master' of /usr/src/ntfs-2.6/
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/00-INDEX | 8 | ||||
-rw-r--r-- | Documentation/filesystems/configfs/configfs.txt | 434 | ||||
-rw-r--r-- | Documentation/filesystems/configfs/configfs_example.c | 474 | ||||
-rw-r--r-- | Documentation/filesystems/dlmfs.txt | 130 | ||||
-rw-r--r-- | Documentation/filesystems/ext3.txt | 181 | ||||
-rw-r--r-- | Documentation/filesystems/fuse.txt | 63 | ||||
-rw-r--r-- | Documentation/filesystems/ocfs2.txt | 55 | ||||
-rw-r--r-- | Documentation/filesystems/proc.txt | 19 | ||||
-rw-r--r-- | Documentation/filesystems/ramfs-rootfs-initramfs.txt | 72 | ||||
-rw-r--r-- | Documentation/filesystems/relayfs.txt | 126 | ||||
-rw-r--r-- | Documentation/filesystems/spufs.txt | 521 | ||||
-rw-r--r-- | Documentation/filesystems/sysfs-pci.txt | 21 | ||||
-rw-r--r-- | Documentation/filesystems/tmpfs.txt | 12 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.txt | 5 |
14 files changed, 1998 insertions, 123 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index bcfbab899b37..74052d22d868 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX | |||
@@ -12,14 +12,16 @@ cifs.txt | |||
12 | - description of the CIFS filesystem | 12 | - description of the CIFS filesystem |
13 | coda.txt | 13 | coda.txt |
14 | - description of the CODA filesystem. | 14 | - description of the CODA filesystem. |
15 | configfs/ | ||
16 | - directory containing configfs documentation and example code. | ||
15 | cramfs.txt | 17 | cramfs.txt |
16 | - info on the cram filesystem for small storage (ROMs etc) | 18 | - info on the cram filesystem for small storage (ROMs etc) |
17 | devfs/ | 19 | devfs/ |
18 | - directory containing devfs documentation. | 20 | - directory containing devfs documentation. |
21 | dlmfs.txt | ||
22 | - info on the userspace interface to the OCFS2 DLM. | ||
19 | ext2.txt | 23 | ext2.txt |
20 | - info, mount options and specifications for the Ext2 filesystem. | 24 | - info, mount options and specifications for the Ext2 filesystem. |
21 | fat_cvf.txt | ||
22 | - info on the Compressed Volume Files extension to the FAT filesystem | ||
23 | hpfs.txt | 25 | hpfs.txt |
24 | - info and mount options for the OS/2 HPFS. | 26 | - info and mount options for the OS/2 HPFS. |
25 | isofs.txt | 27 | isofs.txt |
@@ -32,6 +34,8 @@ ntfs.txt | |||
32 | - info and mount options for the NTFS filesystem (Windows NT). | 34 | - info and mount options for the NTFS filesystem (Windows NT). |
33 | proc.txt | 35 | proc.txt |
34 | - info on Linux's /proc filesystem. | 36 | - info on Linux's /proc filesystem. |
37 | ocfs2.txt | ||
38 | - info and mount options for the OCFS2 clustered filesystem. | ||
35 | romfs.txt | 39 | romfs.txt |
36 | - Description of the ROMFS filesystem. | 40 | - Description of the ROMFS filesystem. |
37 | smbfs.txt | 41 | smbfs.txt |
diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt new file mode 100644 index 000000000000..c4ff96b7c4e0 --- /dev/null +++ b/Documentation/filesystems/configfs/configfs.txt | |||
@@ -0,0 +1,434 @@ | |||
1 | |||
2 | configfs - Userspace-driven kernel object configuation. | ||
3 | |||
4 | Joel Becker <joel.becker@oracle.com> | ||
5 | |||
6 | Updated: 31 March 2005 | ||
7 | |||
8 | Copyright (c) 2005 Oracle Corporation, | ||
9 | Joel Becker <joel.becker@oracle.com> | ||
10 | |||
11 | |||
12 | [What is configfs?] | ||
13 | |||
14 | configfs is a ram-based filesystem that provides the converse of | ||
15 | sysfs's functionality. Where sysfs is a filesystem-based view of | ||
16 | kernel objects, configfs is a filesystem-based manager of kernel | ||
17 | objects, or config_items. | ||
18 | |||
19 | With sysfs, an object is created in kernel (for example, when a device | ||
20 | is discovered) and it is registered with sysfs. Its attributes then | ||
21 | appear in sysfs, allowing userspace to read the attributes via | ||
22 | readdir(3)/read(2). It may allow some attributes to be modified via | ||
23 | write(2). The important point is that the object is created and | ||
24 | destroyed in kernel, the kernel controls the lifecycle of the sysfs | ||
25 | representation, and sysfs is merely a window on all this. | ||
26 | |||
27 | A configfs config_item is created via an explicit userspace operation: | ||
28 | mkdir(2). It is destroyed via rmdir(2). The attributes appear at | ||
29 | mkdir(2) time, and can be read or modified via read(2) and write(2). | ||
30 | As with sysfs, readdir(3) queries the list of items and/or attributes. | ||
31 | symlink(2) can be used to group items together. Unlike sysfs, the | ||
32 | lifetime of the representation is completely driven by userspace. The | ||
33 | kernel modules backing the items must respond to this. | ||
34 | |||
35 | Both sysfs and configfs can and should exist together on the same | ||
36 | system. One is not a replacement for the other. | ||
37 | |||
38 | [Using configfs] | ||
39 | |||
40 | configfs can be compiled as a module or into the kernel. You can access | ||
41 | it by doing | ||
42 | |||
43 | mount -t configfs none /config | ||
44 | |||
45 | The configfs tree will be empty unless client modules are also loaded. | ||
46 | These are modules that register their item types with configfs as | ||
47 | subsystems. Once a client subsystem is loaded, it will appear as a | ||
48 | subdirectory (or more than one) under /config. Like sysfs, the | ||
49 | configfs tree is always there, whether mounted on /config or not. | ||
50 | |||
51 | An item is created via mkdir(2). The item's attributes will also | ||
52 | appear at this time. readdir(3) can determine what the attributes are, | ||
53 | read(2) can query their default values, and write(2) can store new | ||
54 | values. Like sysfs, attributes should be ASCII text files, preferably | ||
55 | with only one value per file. The same efficiency caveats from sysfs | ||
56 | apply. Don't mix more than one attribute in one attribute file. | ||
57 | |||
58 | Like sysfs, configfs expects write(2) to store the entire buffer at | ||
59 | once. When writing to configfs attributes, userspace processes should | ||
60 | first read the entire file, modify the portions they wish to change, and | ||
61 | then write the entire buffer back. Attribute files have a maximum size | ||
62 | of one page (PAGE_SIZE, 4096 on i386). | ||
63 | |||
64 | When an item needs to be destroyed, remove it with rmdir(2). An | ||
65 | item cannot be destroyed if any other item has a link to it (via | ||
66 | symlink(2)). Links can be removed via unlink(2). | ||
67 | |||
68 | [Configuring FakeNBD: an Example] | ||
69 | |||
70 | Imagine there's a Network Block Device (NBD) driver that allows you to | ||
71 | access remote block devices. Call it FakeNBD. FakeNBD uses configfs | ||
72 | for its configuration. Obviously, there will be a nice program that | ||
73 | sysadmins use to configure FakeNBD, but somehow that program has to tell | ||
74 | the driver about it. Here's where configfs comes in. | ||
75 | |||
76 | When the FakeNBD driver is loaded, it registers itself with configfs. | ||
77 | readdir(3) sees this just fine: | ||
78 | |||
79 | # ls /config | ||
80 | fakenbd | ||
81 | |||
82 | A fakenbd connection can be created with mkdir(2). The name is | ||
83 | arbitrary, but likely the tool will make some use of the name. Perhaps | ||
84 | it is a uuid or a disk name: | ||
85 | |||
86 | # mkdir /config/fakenbd/disk1 | ||
87 | # ls /config/fakenbd/disk1 | ||
88 | target device rw | ||
89 | |||
90 | The target attribute contains the IP address of the server FakeNBD will | ||
91 | connect to. The device attribute is the device on the server. | ||
92 | Predictably, the rw attribute determines whether the connection is | ||
93 | read-only or read-write. | ||
94 | |||
95 | # echo 10.0.0.1 > /config/fakenbd/disk1/target | ||
96 | # echo /dev/sda1 > /config/fakenbd/disk1/device | ||
97 | # echo 1 > /config/fakenbd/disk1/rw | ||
98 | |||
99 | That's it. That's all there is. Now the device is configured, via the | ||
100 | shell no less. | ||
101 | |||
102 | [Coding With configfs] | ||
103 | |||
104 | Every object in configfs is a config_item. A config_item reflects an | ||
105 | object in the subsystem. It has attributes that match values on that | ||
106 | object. configfs handles the filesystem representation of that object | ||
107 | and its attributes, allowing the subsystem to ignore all but the | ||
108 | basic show/store interaction. | ||
109 | |||
110 | Items are created and destroyed inside a config_group. A group is a | ||
111 | collection of items that share the same attributes and operations. | ||
112 | Items are created by mkdir(2) and removed by rmdir(2), but configfs | ||
113 | handles that. The group has a set of operations to perform these tasks | ||
114 | |||
115 | A subsystem is the top level of a client module. During initialization, | ||
116 | the client module registers the subsystem with configfs, the subsystem | ||
117 | appears as a directory at the top of the configfs filesystem. A | ||
118 | subsystem is also a config_group, and can do everything a config_group | ||
119 | can. | ||
120 | |||
121 | [struct config_item] | ||
122 | |||
123 | struct config_item { | ||
124 | char *ci_name; | ||
125 | char ci_namebuf[UOBJ_NAME_LEN]; | ||
126 | struct kref ci_kref; | ||
127 | struct list_head ci_entry; | ||
128 | struct config_item *ci_parent; | ||
129 | struct config_group *ci_group; | ||
130 | struct config_item_type *ci_type; | ||
131 | struct dentry *ci_dentry; | ||
132 | }; | ||
133 | |||
134 | void config_item_init(struct config_item *); | ||
135 | void config_item_init_type_name(struct config_item *, | ||
136 | const char *name, | ||
137 | struct config_item_type *type); | ||
138 | struct config_item *config_item_get(struct config_item *); | ||
139 | void config_item_put(struct config_item *); | ||
140 | |||
141 | Generally, struct config_item is embedded in a container structure, a | ||
142 | structure that actually represents what the subsystem is doing. The | ||
143 | config_item portion of that structure is how the object interacts with | ||
144 | configfs. | ||
145 | |||
146 | Whether statically defined in a source file or created by a parent | ||
147 | config_group, a config_item must have one of the _init() functions | ||
148 | called on it. This initializes the reference count and sets up the | ||
149 | appropriate fields. | ||
150 | |||
151 | All users of a config_item should have a reference on it via | ||
152 | config_item_get(), and drop the reference when they are done via | ||
153 | config_item_put(). | ||
154 | |||
155 | By itself, a config_item cannot do much more than appear in configfs. | ||
156 | Usually a subsystem wants the item to display and/or store attributes, | ||
157 | among other things. For that, it needs a type. | ||
158 | |||
159 | [struct config_item_type] | ||
160 | |||
161 | struct configfs_item_operations { | ||
162 | void (*release)(struct config_item *); | ||
163 | ssize_t (*show_attribute)(struct config_item *, | ||
164 | struct configfs_attribute *, | ||
165 | char *); | ||
166 | ssize_t (*store_attribute)(struct config_item *, | ||
167 | struct configfs_attribute *, | ||
168 | const char *, size_t); | ||
169 | int (*allow_link)(struct config_item *src, | ||
170 | struct config_item *target); | ||
171 | int (*drop_link)(struct config_item *src, | ||
172 | struct config_item *target); | ||
173 | }; | ||
174 | |||
175 | struct config_item_type { | ||
176 | struct module *ct_owner; | ||
177 | struct configfs_item_operations *ct_item_ops; | ||
178 | struct configfs_group_operations *ct_group_ops; | ||
179 | struct configfs_attribute **ct_attrs; | ||
180 | }; | ||
181 | |||
182 | The most basic function of a config_item_type is to define what | ||
183 | operations can be performed on a config_item. All items that have been | ||
184 | allocated dynamically will need to provide the ct_item_ops->release() | ||
185 | method. This method is called when the config_item's reference count | ||
186 | reaches zero. Items that wish to display an attribute need to provide | ||
187 | the ct_item_ops->show_attribute() method. Similarly, storing a new | ||
188 | attribute value uses the store_attribute() method. | ||
189 | |||
190 | [struct configfs_attribute] | ||
191 | |||
192 | struct configfs_attribute { | ||
193 | char *ca_name; | ||
194 | struct module *ca_owner; | ||
195 | mode_t ca_mode; | ||
196 | }; | ||
197 | |||
198 | When a config_item wants an attribute to appear as a file in the item's | ||
199 | configfs directory, it must define a configfs_attribute describing it. | ||
200 | It then adds the attribute to the NULL-terminated array | ||
201 | config_item_type->ct_attrs. When the item appears in configfs, the | ||
202 | attribute file will appear with the configfs_attribute->ca_name | ||
203 | filename. configfs_attribute->ca_mode specifies the file permissions. | ||
204 | |||
205 | If an attribute is readable and the config_item provides a | ||
206 | ct_item_ops->show_attribute() method, that method will be called | ||
207 | whenever userspace asks for a read(2) on the attribute. The converse | ||
208 | will happen for write(2). | ||
209 | |||
210 | [struct config_group] | ||
211 | |||
212 | A config_item cannot live in a vaccum. The only way one can be created | ||
213 | is via mkdir(2) on a config_group. This will trigger creation of a | ||
214 | child item. | ||
215 | |||
216 | struct config_group { | ||
217 | struct config_item cg_item; | ||
218 | struct list_head cg_children; | ||
219 | struct configfs_subsystem *cg_subsys; | ||
220 | struct config_group **default_groups; | ||
221 | }; | ||
222 | |||
223 | void config_group_init(struct config_group *group); | ||
224 | void config_group_init_type_name(struct config_group *group, | ||
225 | const char *name, | ||
226 | struct config_item_type *type); | ||
227 | |||
228 | |||
229 | The config_group structure contains a config_item. Properly configuring | ||
230 | that item means that a group can behave as an item in its own right. | ||
231 | However, it can do more: it can create child items or groups. This is | ||
232 | accomplished via the group operations specified on the group's | ||
233 | config_item_type. | ||
234 | |||
235 | struct configfs_group_operations { | ||
236 | struct config_item *(*make_item)(struct config_group *group, | ||
237 | const char *name); | ||
238 | struct config_group *(*make_group)(struct config_group *group, | ||
239 | const char *name); | ||
240 | int (*commit_item)(struct config_item *item); | ||
241 | void (*drop_item)(struct config_group *group, | ||
242 | struct config_item *item); | ||
243 | }; | ||
244 | |||
245 | A group creates child items by providing the | ||
246 | ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new | ||
247 | config_item (or more likely, its container structure), initializes it, | ||
248 | and returns it to configfs. Configfs will then populate the filesystem | ||
249 | tree to reflect the new item. | ||
250 | |||
251 | If the subsystem wants the child to be a group itself, the subsystem | ||
252 | provides ct_group_ops->make_group(). Everything else behaves the same, | ||
253 | using the group _init() functions on the group. | ||
254 | |||
255 | Finally, when userspace calls rmdir(2) on the item or group, | ||
256 | ct_group_ops->drop_item() is called. As a config_group is also a | ||
257 | config_item, it is not necessary for a seperate drop_group() method. | ||
258 | The subsystem must config_item_put() the reference that was initialized | ||
259 | upon item allocation. If a subsystem has no work to do, it may omit | ||
260 | the ct_group_ops->drop_item() method, and configfs will call | ||
261 | config_item_put() on the item on behalf of the subsystem. | ||
262 | |||
263 | IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) | ||
264 | is called, configfs WILL remove the item from the filesystem tree | ||
265 | (assuming that it has no children to keep it busy). The subsystem is | ||
266 | responsible for responding to this. If the subsystem has references to | ||
267 | the item in other threads, the memory is safe. It may take some time | ||
268 | for the item to actually disappear from the subsystem's usage. But it | ||
269 | is gone from configfs. | ||
270 | |||
271 | A config_group cannot be removed while it still has child items. This | ||
272 | is implemented in the configfs rmdir(2) code. ->drop_item() will not be | ||
273 | called, as the item has not been dropped. rmdir(2) will fail, as the | ||
274 | directory is not empty. | ||
275 | |||
276 | [struct configfs_subsystem] | ||
277 | |||
278 | A subsystem must register itself, ususally at module_init time. This | ||
279 | tells configfs to make the subsystem appear in the file tree. | ||
280 | |||
281 | struct configfs_subsystem { | ||
282 | struct config_group su_group; | ||
283 | struct semaphore su_sem; | ||
284 | }; | ||
285 | |||
286 | int configfs_register_subsystem(struct configfs_subsystem *subsys); | ||
287 | void configfs_unregister_subsystem(struct configfs_subsystem *subsys); | ||
288 | |||
289 | A subsystem consists of a toplevel config_group and a semaphore. | ||
290 | The group is where child config_items are created. For a subsystem, | ||
291 | this group is usually defined statically. Before calling | ||
292 | configfs_register_subsystem(), the subsystem must have initialized the | ||
293 | group via the usual group _init() functions, and it must also have | ||
294 | initialized the semaphore. | ||
295 | When the register call returns, the subsystem is live, and it | ||
296 | will be visible via configfs. At that point, mkdir(2) can be called and | ||
297 | the subsystem must be ready for it. | ||
298 | |||
299 | [An Example] | ||
300 | |||
301 | The best example of these basic concepts is the simple_children | ||
302 | subsystem/group and the simple_child item in configfs_example.c It | ||
303 | shows a trivial object displaying and storing an attribute, and a simple | ||
304 | group creating and destroying these children. | ||
305 | |||
306 | [Hierarchy Navigation and the Subsystem Semaphore] | ||
307 | |||
308 | There is an extra bonus that configfs provides. The config_groups and | ||
309 | config_items are arranged in a hierarchy due to the fact that they | ||
310 | appear in a filesystem. A subsystem is NEVER to touch the filesystem | ||
311 | parts, but the subsystem might be interested in this hierarchy. For | ||
312 | this reason, the hierarchy is mirrored via the config_group->cg_children | ||
313 | and config_item->ci_parent structure members. | ||
314 | |||
315 | A subsystem can navigate the cg_children list and the ci_parent pointer | ||
316 | to see the tree created by the subsystem. This can race with configfs' | ||
317 | management of the hierarchy, so configfs uses the subsystem semaphore to | ||
318 | protect modifications. Whenever a subsystem wants to navigate the | ||
319 | hierarchy, it must do so under the protection of the subsystem | ||
320 | semaphore. | ||
321 | |||
322 | A subsystem will be prevented from acquiring the semaphore while a newly | ||
323 | allocated item has not been linked into this hierarchy. Similarly, it | ||
324 | will not be able to acquire the semaphore while a dropping item has not | ||
325 | yet been unlinked. This means that an item's ci_parent pointer will | ||
326 | never be NULL while the item is in configfs, and that an item will only | ||
327 | be in its parent's cg_children list for the same duration. This allows | ||
328 | a subsystem to trust ci_parent and cg_children while they hold the | ||
329 | semaphore. | ||
330 | |||
331 | [Item Aggregation Via symlink(2)] | ||
332 | |||
333 | configfs provides a simple group via the group->item parent/child | ||
334 | relationship. Often, however, a larger environment requires aggregation | ||
335 | outside of the parent/child connection. This is implemented via | ||
336 | symlink(2). | ||
337 | |||
338 | A config_item may provide the ct_item_ops->allow_link() and | ||
339 | ct_item_ops->drop_link() methods. If the ->allow_link() method exists, | ||
340 | symlink(2) may be called with the config_item as the source of the link. | ||
341 | These links are only allowed between configfs config_items. Any | ||
342 | symlink(2) attempt outside the configfs filesystem will be denied. | ||
343 | |||
344 | When symlink(2) is called, the source config_item's ->allow_link() | ||
345 | method is called with itself and a target item. If the source item | ||
346 | allows linking to target item, it returns 0. A source item may wish to | ||
347 | reject a link if it only wants links to a certain type of object (say, | ||
348 | in its own subsystem). | ||
349 | |||
350 | When unlink(2) is called on the symbolic link, the source item is | ||
351 | notified via the ->drop_link() method. Like the ->drop_item() method, | ||
352 | this is a void function and cannot return failure. The subsystem is | ||
353 | responsible for responding to the change. | ||
354 | |||
355 | A config_item cannot be removed while it links to any other item, nor | ||
356 | can it be removed while an item links to it. Dangling symlinks are not | ||
357 | allowed in configfs. | ||
358 | |||
359 | [Automatically Created Subgroups] | ||
360 | |||
361 | A new config_group may want to have two types of child config_items. | ||
362 | While this could be codified by magic names in ->make_item(), it is much | ||
363 | more explicit to have a method whereby userspace sees this divergence. | ||
364 | |||
365 | Rather than have a group where some items behave differently than | ||
366 | others, configfs provides a method whereby one or many subgroups are | ||
367 | automatically created inside the parent at its creation. Thus, | ||
368 | mkdir("parent) results in "parent", "parent/subgroup1", up through | ||
369 | "parent/subgroupN". Items of type 1 can now be created in | ||
370 | "parent/subgroup1", and items of type N can be created in | ||
371 | "parent/subgroupN". | ||
372 | |||
373 | These automatic subgroups, or default groups, do not preclude other | ||
374 | children of the parent group. If ct_group_ops->make_group() exists, | ||
375 | other child groups can be created on the parent group directly. | ||
376 | |||
377 | A configfs subsystem specifies default groups by filling in the | ||
378 | NULL-terminated array default_groups on the config_group structure. | ||
379 | Each group in that array is populated in the configfs tree at the same | ||
380 | time as the parent group. Similarly, they are removed at the same time | ||
381 | as the parent. No extra notification is provided. When a ->drop_item() | ||
382 | method call notifies the subsystem the parent group is going away, it | ||
383 | also means every default group child associated with that parent group. | ||
384 | |||
385 | As a consequence of this, default_groups cannot be removed directly via | ||
386 | rmdir(2). They also are not considered when rmdir(2) on the parent | ||
387 | group is checking for children. | ||
388 | |||
389 | [Committable Items] | ||
390 | |||
391 | NOTE: Committable items are currently unimplemented. | ||
392 | |||
393 | Some config_items cannot have a valid initial state. That is, no | ||
394 | default values can be specified for the item's attributes such that the | ||
395 | item can do its work. Userspace must configure one or more attributes, | ||
396 | after which the subsystem can start whatever entity this item | ||
397 | represents. | ||
398 | |||
399 | Consider the FakeNBD device from above. Without a target address *and* | ||
400 | a target device, the subsystem has no idea what block device to import. | ||
401 | The simple example assumes that the subsystem merely waits until all the | ||
402 | appropriate attributes are configured, and then connects. This will, | ||
403 | indeed, work, but now every attribute store must check if the attributes | ||
404 | are initialized. Every attribute store must fire off the connection if | ||
405 | that condition is met. | ||
406 | |||
407 | Far better would be an explicit action notifying the subsystem that the | ||
408 | config_item is ready to go. More importantly, an explicit action allows | ||
409 | the subsystem to provide feedback as to whether the attibutes are | ||
410 | initialized in a way that makes sense. configfs provides this as | ||
411 | committable items. | ||
412 | |||
413 | configfs still uses only normal filesystem operations. An item is | ||
414 | committed via rename(2). The item is moved from a directory where it | ||
415 | can be modified to a directory where it cannot. | ||
416 | |||
417 | Any group that provides the ct_group_ops->commit_item() method has | ||
418 | committable items. When this group appears in configfs, mkdir(2) will | ||
419 | not work directly in the group. Instead, the group will have two | ||
420 | subdirectories: "live" and "pending". The "live" directory does not | ||
421 | support mkdir(2) or rmdir(2) either. It only allows rename(2). The | ||
422 | "pending" directory does allow mkdir(2) and rmdir(2). An item is | ||
423 | created in the "pending" directory. Its attributes can be modified at | ||
424 | will. Userspace commits the item by renaming it into the "live" | ||
425 | directory. At this point, the subsystem recieves the ->commit_item() | ||
426 | callback. If all required attributes are filled to satisfaction, the | ||
427 | method returns zero and the item is moved to the "live" directory. | ||
428 | |||
429 | As rmdir(2) does not work in the "live" directory, an item must be | ||
430 | shutdown, or "uncommitted". Again, this is done via rename(2), this | ||
431 | time from the "live" directory back to the "pending" one. The subsystem | ||
432 | is notified by the ct_group_ops->uncommit_object() method. | ||
433 | |||
434 | |||
diff --git a/Documentation/filesystems/configfs/configfs_example.c b/Documentation/filesystems/configfs/configfs_example.c new file mode 100644 index 000000000000..f3c6e4946f98 --- /dev/null +++ b/Documentation/filesystems/configfs/configfs_example.c | |||
@@ -0,0 +1,474 @@ | |||
1 | /* | ||
2 | * vim: noexpandtab ts=8 sts=0 sw=8: | ||
3 | * | ||
4 | * configfs_example.c - This file is a demonstration module containing | ||
5 | * a number of configfs subsystems. | ||
6 | * | ||
7 | * This program is free software; you can redistribute it and/or | ||
8 | * modify it under the terms of the GNU General Public | ||
9 | * License as published by the Free Software Foundation; either | ||
10 | * version 2 of the License, or (at your option) any later version. | ||
11 | * | ||
12 | * This program is distributed in the hope that it will be useful, | ||
13 | * but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
14 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU | ||
15 | * General Public License for more details. | ||
16 | * | ||
17 | * You should have received a copy of the GNU General Public | ||
18 | * License along with this program; if not, write to the | ||
19 | * Free Software Foundation, Inc., 59 Temple Place - Suite 330, | ||
20 | * Boston, MA 021110-1307, USA. | ||
21 | * | ||
22 | * Based on sysfs: | ||
23 | * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel | ||
24 | * | ||
25 | * configfs Copyright (C) 2005 Oracle. All rights reserved. | ||
26 | */ | ||
27 | |||
28 | #include <linux/init.h> | ||
29 | #include <linux/module.h> | ||
30 | #include <linux/slab.h> | ||
31 | |||
32 | #include <linux/configfs.h> | ||
33 | |||
34 | |||
35 | |||
36 | /* | ||
37 | * 01-childless | ||
38 | * | ||
39 | * This first example is a childless subsystem. It cannot create | ||
40 | * any config_items. It just has attributes. | ||
41 | * | ||
42 | * Note that we are enclosing the configfs_subsystem inside a container. | ||
43 | * This is not necessary if a subsystem has no attributes directly | ||
44 | * on the subsystem. See the next example, 02-simple-children, for | ||
45 | * such a subsystem. | ||
46 | */ | ||
47 | |||
48 | struct childless { | ||
49 | struct configfs_subsystem subsys; | ||
50 | int showme; | ||
51 | int storeme; | ||
52 | }; | ||
53 | |||
54 | struct childless_attribute { | ||
55 | struct configfs_attribute attr; | ||
56 | ssize_t (*show)(struct childless *, char *); | ||
57 | ssize_t (*store)(struct childless *, const char *, size_t); | ||
58 | }; | ||
59 | |||
60 | static inline struct childless *to_childless(struct config_item *item) | ||
61 | { | ||
62 | return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL; | ||
63 | } | ||
64 | |||
65 | static ssize_t childless_showme_read(struct childless *childless, | ||
66 | char *page) | ||
67 | { | ||
68 | ssize_t pos; | ||
69 | |||
70 | pos = sprintf(page, "%d\n", childless->showme); | ||
71 | childless->showme++; | ||
72 | |||
73 | return pos; | ||
74 | } | ||
75 | |||
76 | static ssize_t childless_storeme_read(struct childless *childless, | ||
77 | char *page) | ||
78 | { | ||
79 | return sprintf(page, "%d\n", childless->storeme); | ||
80 | } | ||
81 | |||
82 | static ssize_t childless_storeme_write(struct childless *childless, | ||
83 | const char *page, | ||
84 | size_t count) | ||
85 | { | ||
86 | unsigned long tmp; | ||
87 | char *p = (char *) page; | ||
88 | |||
89 | tmp = simple_strtoul(p, &p, 10); | ||
90 | if (!p || (*p && (*p != '\n'))) | ||
91 | return -EINVAL; | ||
92 | |||
93 | if (tmp > INT_MAX) | ||
94 | return -ERANGE; | ||
95 | |||
96 | childless->storeme = tmp; | ||
97 | |||
98 | return count; | ||
99 | } | ||
100 | |||
101 | static ssize_t childless_description_read(struct childless *childless, | ||
102 | char *page) | ||
103 | { | ||
104 | return sprintf(page, | ||
105 | "[01-childless]\n" | ||
106 | "\n" | ||
107 | "The childless subsystem is the simplest possible subsystem in\n" | ||
108 | "configfs. It does not support the creation of child config_items.\n" | ||
109 | "It only has a few attributes. In fact, it isn't much different\n" | ||
110 | "than a directory in /proc.\n"); | ||
111 | } | ||
112 | |||
113 | static struct childless_attribute childless_attr_showme = { | ||
114 | .attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO }, | ||
115 | .show = childless_showme_read, | ||
116 | }; | ||
117 | static struct childless_attribute childless_attr_storeme = { | ||
118 | .attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR }, | ||
119 | .show = childless_storeme_read, | ||
120 | .store = childless_storeme_write, | ||
121 | }; | ||
122 | static struct childless_attribute childless_attr_description = { | ||
123 | .attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO }, | ||
124 | .show = childless_description_read, | ||
125 | }; | ||
126 | |||
127 | static struct configfs_attribute *childless_attrs[] = { | ||
128 | &childless_attr_showme.attr, | ||
129 | &childless_attr_storeme.attr, | ||
130 | &childless_attr_description.attr, | ||
131 | NULL, | ||
132 | }; | ||
133 | |||
134 | static ssize_t childless_attr_show(struct config_item *item, | ||
135 | struct configfs_attribute *attr, | ||
136 | char *page) | ||
137 | { | ||
138 | struct childless *childless = to_childless(item); | ||
139 | struct childless_attribute *childless_attr = | ||
140 | container_of(attr, struct childless_attribute, attr); | ||
141 | ssize_t ret = 0; | ||
142 | |||
143 | if (childless_attr->show) | ||
144 | ret = childless_attr->show(childless, page); | ||
145 | return ret; | ||
146 | } | ||
147 | |||
148 | static ssize_t childless_attr_store(struct config_item *item, | ||
149 | struct configfs_attribute *attr, | ||
150 | const char *page, size_t count) | ||
151 | { | ||
152 | struct childless *childless = to_childless(item); | ||
153 | struct childless_attribute *childless_attr = | ||
154 | container_of(attr, struct childless_attribute, attr); | ||
155 | ssize_t ret = -EINVAL; | ||
156 | |||
157 | if (childless_attr->store) | ||
158 | ret = childless_attr->store(childless, page, count); | ||
159 | return ret; | ||
160 | } | ||
161 | |||
162 | static struct configfs_item_operations childless_item_ops = { | ||
163 | .show_attribute = childless_attr_show, | ||
164 | .store_attribute = childless_attr_store, | ||
165 | }; | ||
166 | |||
167 | static struct config_item_type childless_type = { | ||
168 | .ct_item_ops = &childless_item_ops, | ||
169 | .ct_attrs = childless_attrs, | ||
170 | .ct_owner = THIS_MODULE, | ||
171 | }; | ||
172 | |||
173 | static struct childless childless_subsys = { | ||
174 | .subsys = { | ||
175 | .su_group = { | ||
176 | .cg_item = { | ||
177 | .ci_namebuf = "01-childless", | ||
178 | .ci_type = &childless_type, | ||
179 | }, | ||
180 | }, | ||
181 | }, | ||
182 | }; | ||
183 | |||
184 | |||
185 | /* ----------------------------------------------------------------- */ | ||
186 | |||
187 | /* | ||
188 | * 02-simple-children | ||
189 | * | ||
190 | * This example merely has a simple one-attribute child. Note that | ||
191 | * there is no extra attribute structure, as the child's attribute is | ||
192 | * known from the get-go. Also, there is no container for the | ||
193 | * subsystem, as it has no attributes of its own. | ||
194 | */ | ||
195 | |||
196 | struct simple_child { | ||
197 | struct config_item item; | ||
198 | int storeme; | ||
199 | }; | ||
200 | |||
201 | static inline struct simple_child *to_simple_child(struct config_item *item) | ||
202 | { | ||
203 | return item ? container_of(item, struct simple_child, item) : NULL; | ||
204 | } | ||
205 | |||
206 | static struct configfs_attribute simple_child_attr_storeme = { | ||
207 | .ca_owner = THIS_MODULE, | ||
208 | .ca_name = "storeme", | ||
209 | .ca_mode = S_IRUGO | S_IWUSR, | ||
210 | }; | ||
211 | |||
212 | static struct configfs_attribute *simple_child_attrs[] = { | ||
213 | &simple_child_attr_storeme, | ||
214 | NULL, | ||
215 | }; | ||
216 | |||
217 | static ssize_t simple_child_attr_show(struct config_item *item, | ||
218 | struct configfs_attribute *attr, | ||
219 | char *page) | ||
220 | { | ||
221 | ssize_t count; | ||
222 | struct simple_child *simple_child = to_simple_child(item); | ||
223 | |||
224 | count = sprintf(page, "%d\n", simple_child->storeme); | ||
225 | |||
226 | return count; | ||
227 | } | ||
228 | |||
229 | static ssize_t simple_child_attr_store(struct config_item *item, | ||
230 | struct configfs_attribute *attr, | ||
231 | const char *page, size_t count) | ||
232 | { | ||
233 | struct simple_child *simple_child = to_simple_child(item); | ||
234 | unsigned long tmp; | ||
235 | char *p = (char *) page; | ||
236 | |||
237 | tmp = simple_strtoul(p, &p, 10); | ||
238 | if (!p || (*p && (*p != '\n'))) | ||
239 | return -EINVAL; | ||
240 | |||
241 | if (tmp > INT_MAX) | ||
242 | return -ERANGE; | ||
243 | |||
244 | simple_child->storeme = tmp; | ||
245 | |||
246 | return count; | ||
247 | } | ||
248 | |||
249 | static void simple_child_release(struct config_item *item) | ||
250 | { | ||
251 | kfree(to_simple_child(item)); | ||
252 | } | ||
253 | |||
254 | static struct configfs_item_operations simple_child_item_ops = { | ||
255 | .release = simple_child_release, | ||
256 | .show_attribute = simple_child_attr_show, | ||
257 | .store_attribute = simple_child_attr_store, | ||
258 | }; | ||
259 | |||
260 | static struct config_item_type simple_child_type = { | ||
261 | .ct_item_ops = &simple_child_item_ops, | ||
262 | .ct_attrs = simple_child_attrs, | ||
263 | .ct_owner = THIS_MODULE, | ||
264 | }; | ||
265 | |||
266 | |||
267 | static struct config_item *simple_children_make_item(struct config_group *group, const char *name) | ||
268 | { | ||
269 | struct simple_child *simple_child; | ||
270 | |||
271 | simple_child = kmalloc(sizeof(struct simple_child), GFP_KERNEL); | ||
272 | if (!simple_child) | ||
273 | return NULL; | ||
274 | |||
275 | memset(simple_child, 0, sizeof(struct simple_child)); | ||
276 | |||
277 | config_item_init_type_name(&simple_child->item, name, | ||
278 | &simple_child_type); | ||
279 | |||
280 | simple_child->storeme = 0; | ||
281 | |||
282 | return &simple_child->item; | ||
283 | } | ||
284 | |||
285 | static struct configfs_attribute simple_children_attr_description = { | ||
286 | .ca_owner = THIS_MODULE, | ||
287 | .ca_name = "description", | ||
288 | .ca_mode = S_IRUGO, | ||
289 | }; | ||
290 | |||
291 | static struct configfs_attribute *simple_children_attrs[] = { | ||
292 | &simple_children_attr_description, | ||
293 | NULL, | ||
294 | }; | ||
295 | |||
296 | static ssize_t simple_children_attr_show(struct config_item *item, | ||
297 | struct configfs_attribute *attr, | ||
298 | char *page) | ||
299 | { | ||
300 | return sprintf(page, | ||
301 | "[02-simple-children]\n" | ||
302 | "\n" | ||
303 | "This subsystem allows the creation of child config_items. These\n" | ||
304 | "items have only one attribute that is readable and writeable.\n"); | ||
305 | } | ||
306 | |||
307 | static struct configfs_item_operations simple_children_item_ops = { | ||
308 | .show_attribute = simple_children_attr_show, | ||
309 | }; | ||
310 | |||
311 | /* | ||
312 | * Note that, since no extra work is required on ->drop_item(), | ||
313 | * no ->drop_item() is provided. | ||
314 | */ | ||
315 | static struct configfs_group_operations simple_children_group_ops = { | ||
316 | .make_item = simple_children_make_item, | ||
317 | }; | ||
318 | |||
319 | static struct config_item_type simple_children_type = { | ||
320 | .ct_item_ops = &simple_children_item_ops, | ||
321 | .ct_group_ops = &simple_children_group_ops, | ||
322 | .ct_attrs = simple_children_attrs, | ||
323 | }; | ||
324 | |||
325 | static struct configfs_subsystem simple_children_subsys = { | ||
326 | .su_group = { | ||
327 | .cg_item = { | ||
328 | .ci_namebuf = "02-simple-children", | ||
329 | .ci_type = &simple_children_type, | ||
330 | }, | ||
331 | }, | ||
332 | }; | ||
333 | |||
334 | |||
335 | /* ----------------------------------------------------------------- */ | ||
336 | |||
337 | /* | ||
338 | * 03-group-children | ||
339 | * | ||
340 | * This example reuses the simple_children group from above. However, | ||
341 | * the simple_children group is not the subsystem itself, it is a | ||
342 | * child of the subsystem. Creation of a group in the subsystem creates | ||
343 | * a new simple_children group. That group can then have simple_child | ||
344 | * children of its own. | ||
345 | */ | ||
346 | |||
347 | struct simple_children { | ||
348 | struct config_group group; | ||
349 | }; | ||
350 | |||
351 | static struct config_group *group_children_make_group(struct config_group *group, const char *name) | ||
352 | { | ||
353 | struct simple_children *simple_children; | ||
354 | |||
355 | simple_children = kmalloc(sizeof(struct simple_children), | ||
356 | GFP_KERNEL); | ||
357 | if (!simple_children) | ||
358 | return NULL; | ||
359 | |||
360 | memset(simple_children, 0, sizeof(struct simple_children)); | ||
361 | |||
362 | config_group_init_type_name(&simple_children->group, name, | ||
363 | &simple_children_type); | ||
364 | |||
365 | return &simple_children->group; | ||
366 | } | ||
367 | |||
368 | static struct configfs_attribute group_children_attr_description = { | ||
369 | .ca_owner = THIS_MODULE, | ||
370 | .ca_name = "description", | ||
371 | .ca_mode = S_IRUGO, | ||
372 | }; | ||
373 | |||
374 | static struct configfs_attribute *group_children_attrs[] = { | ||
375 | &group_children_attr_description, | ||
376 | NULL, | ||
377 | }; | ||
378 | |||
379 | static ssize_t group_children_attr_show(struct config_item *item, | ||
380 | struct configfs_attribute *attr, | ||
381 | char *page) | ||
382 | { | ||
383 | return sprintf(page, | ||
384 | "[03-group-children]\n" | ||
385 | "\n" | ||
386 | "This subsystem allows the creation of child config_groups. These\n" | ||
387 | "groups are like the subsystem simple-children.\n"); | ||
388 | } | ||
389 | |||
390 | static struct configfs_item_operations group_children_item_ops = { | ||
391 | .show_attribute = group_children_attr_show, | ||
392 | }; | ||
393 | |||
394 | /* | ||
395 | * Note that, since no extra work is required on ->drop_item(), | ||
396 | * no ->drop_item() is provided. | ||
397 | */ | ||
398 | static struct configfs_group_operations group_children_group_ops = { | ||
399 | .make_group = group_children_make_group, | ||
400 | }; | ||
401 | |||
402 | static struct config_item_type group_children_type = { | ||
403 | .ct_item_ops = &group_children_item_ops, | ||
404 | .ct_group_ops = &group_children_group_ops, | ||
405 | .ct_attrs = group_children_attrs, | ||
406 | }; | ||
407 | |||
408 | static struct configfs_subsystem group_children_subsys = { | ||
409 | .su_group = { | ||
410 | .cg_item = { | ||
411 | .ci_namebuf = "03-group-children", | ||
412 | .ci_type = &group_children_type, | ||
413 | }, | ||
414 | }, | ||
415 | }; | ||
416 | |||
417 | /* ----------------------------------------------------------------- */ | ||
418 | |||
419 | /* | ||
420 | * We're now done with our subsystem definitions. | ||
421 | * For convenience in this module, here's a list of them all. It | ||
422 | * allows the init function to easily register them. Most modules | ||
423 | * will only have one subsystem, and will only call register_subsystem | ||
424 | * on it directly. | ||
425 | */ | ||
426 | static struct configfs_subsystem *example_subsys[] = { | ||
427 | &childless_subsys.subsys, | ||
428 | &simple_children_subsys, | ||
429 | &group_children_subsys, | ||
430 | NULL, | ||
431 | }; | ||
432 | |||
433 | static int __init configfs_example_init(void) | ||
434 | { | ||
435 | int ret; | ||
436 | int i; | ||
437 | struct configfs_subsystem *subsys; | ||
438 | |||
439 | for (i = 0; example_subsys[i]; i++) { | ||
440 | subsys = example_subsys[i]; | ||
441 | |||
442 | config_group_init(&subsys->su_group); | ||
443 | init_MUTEX(&subsys->su_sem); | ||
444 | ret = configfs_register_subsystem(subsys); | ||
445 | if (ret) { | ||
446 | printk(KERN_ERR "Error %d while registering subsystem %s\n", | ||
447 | ret, | ||
448 | subsys->su_group.cg_item.ci_namebuf); | ||
449 | goto out_unregister; | ||
450 | } | ||
451 | } | ||
452 | |||
453 | return 0; | ||
454 | |||
455 | out_unregister: | ||
456 | for (; i >= 0; i--) { | ||
457 | configfs_unregister_subsystem(example_subsys[i]); | ||
458 | } | ||
459 | |||
460 | return ret; | ||
461 | } | ||
462 | |||
463 | static void __exit configfs_example_exit(void) | ||
464 | { | ||
465 | int i; | ||
466 | |||
467 | for (i = 0; example_subsys[i]; i++) { | ||
468 | configfs_unregister_subsystem(example_subsys[i]); | ||
469 | } | ||
470 | } | ||
471 | |||
472 | module_init(configfs_example_init); | ||
473 | module_exit(configfs_example_exit); | ||
474 | MODULE_LICENSE("GPL"); | ||
diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt new file mode 100644 index 000000000000..9afab845a906 --- /dev/null +++ b/Documentation/filesystems/dlmfs.txt | |||
@@ -0,0 +1,130 @@ | |||
1 | dlmfs | ||
2 | ================== | ||
3 | A minimal DLM userspace interface implemented via a virtual file | ||
4 | system. | ||
5 | |||
6 | dlmfs is built with OCFS2 as it requires most of its infrastructure. | ||
7 | |||
8 | Project web page: http://oss.oracle.com/projects/ocfs2 | ||
9 | Tools web page: http://oss.oracle.com/projects/ocfs2-tools | ||
10 | OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ | ||
11 | |||
12 | All code copyright 2005 Oracle except when otherwise noted. | ||
13 | |||
14 | CREDITS | ||
15 | ======= | ||
16 | |||
17 | Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds | ||
18 | and Transmeta Corp. | ||
19 | |||
20 | Mark Fasheh <mark.fasheh@oracle.com> | ||
21 | |||
22 | Caveats | ||
23 | ======= | ||
24 | - Right now it only works with the OCFS2 DLM, though support for other | ||
25 | DLM implementations should not be a major issue. | ||
26 | |||
27 | Mount options | ||
28 | ============= | ||
29 | None | ||
30 | |||
31 | Usage | ||
32 | ===== | ||
33 | |||
34 | If you're just interested in OCFS2, then please see ocfs2.txt. The | ||
35 | rest of this document will be geared towards those who want to use | ||
36 | dlmfs for easy to setup and easy to use clustered locking in | ||
37 | userspace. | ||
38 | |||
39 | Setup | ||
40 | ===== | ||
41 | |||
42 | dlmfs requires that the OCFS2 cluster infrastructure be in | ||
43 | place. Please download ocfs2-tools from the above url and configure a | ||
44 | cluster. | ||
45 | |||
46 | You'll want to start heartbeating on a volume which all the nodes in | ||
47 | your lockspace can access. The easiest way to do this is via | ||
48 | ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires | ||
49 | that an OCFS2 file system be in place so that it can automatically | ||
50 | find it's heartbeat area, though it will eventually support heartbeat | ||
51 | against raw disks. | ||
52 | |||
53 | Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed | ||
54 | with ocfs2-tools. | ||
55 | |||
56 | Once you're heartbeating, DLM lock 'domains' can be easily created / | ||
57 | destroyed and locks within them accessed. | ||
58 | |||
59 | Locking | ||
60 | ======= | ||
61 | |||
62 | Users may access dlmfs via standard file system calls, or they can use | ||
63 | 'libo2dlm' (distributed with ocfs2-tools) which abstracts the file | ||
64 | system calls and presents a more traditional locking api. | ||
65 | |||
66 | dlmfs handles lock caching automatically for the user, so a lock | ||
67 | request for an already acquired lock will not generate another DLM | ||
68 | call. Userspace programs are assumed to handle their own local | ||
69 | locking. | ||
70 | |||
71 | Two levels of locks are supported - Shared Read, and Exlcusive. | ||
72 | Also supported is a Trylock operation. | ||
73 | |||
74 | For information on the libo2dlm interface, please see o2dlm.h, | ||
75 | distributed with ocfs2-tools. | ||
76 | |||
77 | Lock value blocks can be read and written to a resource via read(2) | ||
78 | and write(2) against the fd obtained via your open(2) call. The | ||
79 | maximum currently supported LVB length is 64 bytes (though that is an | ||
80 | OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share | ||
81 | small amounts of data amongst their nodes. | ||
82 | |||
83 | mkdir(2) signals dlmfs to join a domain (which will have the same name | ||
84 | as the resulting directory) | ||
85 | |||
86 | rmdir(2) signals dlmfs to leave the domain | ||
87 | |||
88 | Locks for a given domain are represented by regular inodes inside the | ||
89 | domain directory. Locking against them is done via the open(2) system | ||
90 | call. | ||
91 | |||
92 | The open(2) call will not return until your lock has been granted or | ||
93 | an error has occurred, unless it has been instructed to do a trylock | ||
94 | operation. If the lock succeeds, you'll get an fd. | ||
95 | |||
96 | open(2) with O_CREAT to ensure the resource inode is created - dlmfs does | ||
97 | not automatically create inodes for existing lock resources. | ||
98 | |||
99 | Open Flag Lock Request Type | ||
100 | --------- ----------------- | ||
101 | O_RDONLY Shared Read | ||
102 | O_RDWR Exclusive | ||
103 | |||
104 | Open Flag Resulting Locking Behavior | ||
105 | --------- -------------------------- | ||
106 | O_NONBLOCK Trylock operation | ||
107 | |||
108 | You must provide exactly one of O_RDONLY or O_RDWR. | ||
109 | |||
110 | If O_NONBLOCK is also provided and the trylock operation was valid but | ||
111 | could not lock the resource then open(2) will return ETXTBUSY. | ||
112 | |||
113 | close(2) drops the lock associated with your fd. | ||
114 | |||
115 | Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is | ||
116 | supported locally as well. This means you can use them to restrict | ||
117 | access to the resources via dlmfs on your local node only. | ||
118 | |||
119 | The resource LVB may be read from the fd in either Shared Read or | ||
120 | Exclusive modes via the read(2) system call. It can be written via | ||
121 | write(2) only when open in Exclusive mode. | ||
122 | |||
123 | Once written, an LVB will be visible to other nodes who obtain Read | ||
124 | Only or higher level locks on the resource. | ||
125 | |||
126 | See Also | ||
127 | ======== | ||
128 | http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf | ||
129 | |||
130 | For more information on the VMS distributed locking API. | ||
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9840d5b8d5b9..afb1335c05d6 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt | |||
@@ -2,11 +2,11 @@ | |||
2 | Ext3 Filesystem | 2 | Ext3 Filesystem |
3 | =============== | 3 | =============== |
4 | 4 | ||
5 | ext3 was originally released in September 1999. Written by Stephen Tweedie | 5 | Ext3 was originally released in September 1999. Written by Stephen Tweedie |
6 | for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, | 6 | for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, |
7 | Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. | 7 | Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. |
8 | 8 | ||
9 | ext3 is ext2 filesystem enhanced with journalling capabilities. | 9 | Ext3 is the ext2 filesystem enhanced with journalling capabilities. |
10 | 10 | ||
11 | Options | 11 | Options |
12 | ======= | 12 | ======= |
@@ -14,76 +14,81 @@ Options | |||
14 | When mounting an ext3 filesystem, the following option are accepted: | 14 | When mounting an ext3 filesystem, the following option are accepted: |
15 | (*) == default | 15 | (*) == default |
16 | 16 | ||
17 | jounal=update Update the ext3 file system's journal to the | 17 | journal=update Update the ext3 file system's journal to the current |
18 | current format. | 18 | format. |
19 | 19 | ||
20 | journal=inum When a journal already exists, this option is | 20 | journal=inum When a journal already exists, this option is ignored. |
21 | ignored. Otherwise, it specifies the number of | 21 | Otherwise, it specifies the number of the inode which |
22 | the inode which will represent the ext3 file | 22 | will represent the ext3 file system's journal file. |
23 | system's journal file. | 23 | |
24 | journal_dev=devnum When the external journal device's major/minor numbers | ||
25 | have changed, this option allows the user to specify | ||
26 | the new journal location. The journal device is | ||
27 | identified through its new major/minor numbers encoded | ||
28 | in devnum. | ||
24 | 29 | ||
25 | noload Don't load the journal on mounting. | 30 | noload Don't load the journal on mounting. |
26 | 31 | ||
27 | data=journal All data are committed into the journal prior | 32 | data=journal All data are committed into the journal prior to being |
28 | to being written into the main file system. | 33 | written into the main file system. |
29 | 34 | ||
30 | data=ordered (*) All data are forced directly out to the main file | 35 | data=ordered (*) All data are forced directly out to the main file |
31 | system prior to its metadata being committed to | 36 | system prior to its metadata being committed to the |
32 | the journal. | 37 | journal. |
33 | 38 | ||
34 | data=writeback Data ordering is not preserved, data may be | 39 | data=writeback Data ordering is not preserved, data may be written |
35 | written into the main file system after its | 40 | into the main file system after its metadata has been |
36 | metadata has been committed to the journal. | 41 | committed to the journal. |
37 | 42 | ||
38 | commit=nrsec (*) Ext3 can be told to sync all its data and metadata | 43 | commit=nrsec (*) Ext3 can be told to sync all its data and metadata |
39 | every 'nrsec' seconds. The default value is 5 seconds. | 44 | every 'nrsec' seconds. The default value is 5 seconds. |
40 | This means that if you lose your power, you will lose, | 45 | This means that if you lose your power, you will lose |
41 | as much, the latest 5 seconds of work (your filesystem | 46 | as much as the latest 5 seconds of work (your |
42 | will not be damaged though, thanks to journaling). This | 47 | filesystem will not be damaged though, thanks to the |
43 | default value (or any low value) will hurt performance, | 48 | journaling). This default value (or any low value) |
44 | but it's good for data-safety. Setting it to 0 will | 49 | will hurt performance, but it's good for data-safety. |
45 | have the same effect than leaving the default 5 sec. | 50 | Setting it to 0 will have the same effect as leaving |
51 | it at the default (5 seconds). | ||
46 | Setting it to very large values will improve | 52 | Setting it to very large values will improve |
47 | performance. | 53 | performance. |
48 | 54 | ||
49 | barrier=1 This enables/disables barriers. barrier=0 disables it, | 55 | barrier=1 This enables/disables barriers. barrier=0 disables |
50 | barrier=1 enables it. | 56 | it, barrier=1 enables it. |
51 | 57 | ||
52 | orlov (*) This enables the new Orlov block allocator. It's enabled | 58 | orlov (*) This enables the new Orlov block allocator. It is |
53 | by default. | 59 | enabled by default. |
54 | 60 | ||
55 | oldalloc This disables the Orlov block allocator and enables the | 61 | oldalloc This disables the Orlov block allocator and enables |
56 | old block allocator. Orlov should have better performance, | 62 | the old block allocator. Orlov should have better |
57 | we'd like to get some feedback if it's the contrary for | 63 | performance - we'd like to get some feedback if it's |
58 | you. | 64 | the contrary for you. |
59 | 65 | ||
60 | user_xattr Enables Extended User Attributes. Additionally, you need | 66 | user_xattr Enables Extended User Attributes. Additionally, you |
61 | to have extended attribute support enabled in the kernel | 67 | need to have extended attribute support enabled in the |
62 | configuration (CONFIG_EXT3_FS_XATTR). See the attr(5) | 68 | kernel configuration (CONFIG_EXT3_FS_XATTR). See the |
63 | manual page and http://acl.bestbits.at to learn more | 69 | attr(5) manual page and http://acl.bestbits.at/ to |
64 | about extended attributes. | 70 | learn more about extended attributes. |
65 | 71 | ||
66 | nouser_xattr Disables Extended User Attributes. | 72 | nouser_xattr Disables Extended User Attributes. |
67 | 73 | ||
68 | acl Enables POSIX Access Control Lists support. Additionally, | 74 | acl Enables POSIX Access Control Lists support. |
69 | you need to have ACL support enabled in the kernel | 75 | Additionally, you need to have ACL support enabled in |
70 | configuration (CONFIG_EXT3_FS_POSIX_ACL). See the acl(5) | 76 | the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL). |
71 | manual page and http://acl.bestbits.at for more | 77 | See the acl(5) manual page and http://acl.bestbits.at/ |
72 | information. | 78 | for more information. |
73 | 79 | ||
74 | noacl This option disables POSIX Access Control List support. | 80 | noacl This option disables POSIX Access Control List |
81 | support. | ||
75 | 82 | ||
76 | reservation | 83 | reservation |
77 | 84 | ||
78 | noreservation | 85 | noreservation |
79 | 86 | ||
80 | resize= | ||
81 | |||
82 | bsddf (*) Make 'df' act like BSD. | 87 | bsddf (*) Make 'df' act like BSD. |
83 | minixdf Make 'df' act like Minix. | 88 | minixdf Make 'df' act like Minix. |
84 | 89 | ||
85 | check=none Don't do extra checking of bitmaps on mount. | 90 | check=none Don't do extra checking of bitmaps on mount. |
86 | nocheck | 91 | nocheck |
87 | 92 | ||
88 | debug Extra debugging information is sent to syslog. | 93 | debug Extra debugging information is sent to syslog. |
89 | 94 | ||
@@ -92,7 +97,7 @@ errors=continue Keep going on a filesystem error. | |||
92 | errors=panic Panic and halt the machine if an error occurs. | 97 | errors=panic Panic and halt the machine if an error occurs. |
93 | 98 | ||
94 | grpid Give objects the same group ID as their creator. | 99 | grpid Give objects the same group ID as their creator. |
95 | bsdgroups | 100 | bsdgroups |
96 | 101 | ||
97 | nogrpid (*) New objects have the group ID of their creator. | 102 | nogrpid (*) New objects have the group ID of their creator. |
98 | sysvgroups | 103 | sysvgroups |
@@ -103,81 +108,83 @@ resuid=n The user ID which may use the reserved blocks. | |||
103 | 108 | ||
104 | sb=n Use alternate superblock at this location. | 109 | sb=n Use alternate superblock at this location. |
105 | 110 | ||
106 | quota Quota options are currently silently ignored. | 111 | quota |
107 | noquota (see fs/ext3/super.c, line 594) | 112 | noquota |
108 | grpquota | 113 | grpquota |
109 | usrquota | 114 | usrquota |
110 | 115 | ||
111 | 116 | ||
112 | Specification | 117 | Specification |
113 | ============= | 118 | ============= |
114 | ext3 shares all disk implementation with ext2 filesystem, and add | 119 | Ext3 shares all disk implementation with the ext2 filesystem, and adds |
115 | transactions capabilities to ext2. Journaling is done by the | 120 | transactions capabilities to ext2. Journaling is done by the Journaling Block |
116 | Journaling block device layer. | 121 | Device layer. |
117 | 122 | ||
118 | Journaling Block Device layer | 123 | Journaling Block Device layer |
119 | ----------------------------- | 124 | ----------------------------- |
120 | The Journaling Block Device layer (JBD) isn't ext3 specific. It was | 125 | The Journaling Block Device layer (JBD) isn't ext3 specific. It was design to |
121 | design to add journaling capabilities on a block device. The ext3 | 126 | add journaling capabilities on a block device. The ext3 filesystem code will |
122 | filesystem code will inform the JBD of modifications it is performing | 127 | inform the JBD of modifications it is performing (called a transaction). The |
123 | (Call a transaction). the journal support the transactions start and | 128 | journal supports the transactions start and stop, and in case of crash, the |
124 | stop, and in case of crash, the journal can replayed the transactions | 129 | journal can replayed the transactions to put the partition back in a |
125 | to put the partition on a consistent state fastly. | 130 | consistent state fast. |
126 | 131 | ||
127 | handles represent a single atomic update to a filesystem. JBD can | 132 | Handles represent a single atomic update to a filesystem. JBD can handle an |
128 | handle external journal on a block device. | 133 | external journal on a block device. |
129 | 134 | ||
130 | Data Mode | 135 | Data Mode |
131 | --------- | 136 | --------- |
132 | There's 3 different data modes: | 137 | There are 3 different data modes: |
133 | 138 | ||
134 | * writeback mode | 139 | * writeback mode |
135 | In data=writeback mode, ext3 does not journal data at all. This mode | 140 | In data=writeback mode, ext3 does not journal data at all. This mode provides |
136 | provides a similar level of journaling as XFS, JFS, and ReiserFS in its | 141 | a similar level of journaling as that of XFS, JFS, and ReiserFS in its default |
137 | default mode - metadata journaling. A crash+recovery can cause | 142 | mode - metadata journaling. A crash+recovery can cause incorrect data to |
138 | incorrect data to appear in files which were written shortly before the | 143 | appear in files which were written shortly before the crash. This mode will |
139 | crash. This mode will typically provide the best ext3 performance. | 144 | typically provide the best ext3 performance. |
140 | 145 | ||
141 | * ordered mode | 146 | * ordered mode |
142 | In data=ordered mode, ext3 only officially journals metadata, but it | 147 | In data=ordered mode, ext3 only officially journals metadata, but it logically |
143 | logically groups metadata and data blocks into a single unit called a | 148 | groups metadata and data blocks into a single unit called a transaction. When |
144 | transaction. When it's time to write the new metadata out to disk, the | 149 | it's time to write the new metadata out to disk, the associated data blocks |
145 | associated data blocks are written first. In general, this mode | 150 | are written first. In general, this mode performs slightly slower than |
146 | perform slightly slower than writeback but significantly faster than | 151 | writeback but significantly faster than journal mode. |
147 | journal mode. | ||
148 | 152 | ||
149 | * journal mode | 153 | * journal mode |
150 | data=journal mode provides full data and metadata journaling. All new | 154 | data=journal mode provides full data and metadata journaling. All new data is |
151 | data is written to the journal first, and then to its final location. | 155 | written to the journal first, and then to its final location. |
152 | In the event of a crash, the journal can be replayed, bringing both | 156 | In the event of a crash, the journal can be replayed, bringing both data and |
153 | data and metadata into a consistent state. This mode is the slowest | 157 | metadata into a consistent state. This mode is the slowest except when data |
154 | except when data needs to be read from and written to disk at the same | 158 | needs to be read from and written to disk at the same time where it |
155 | time where it outperform all others mode. | 159 | outperforms all others modes. |
156 | 160 | ||
157 | Compatibility | 161 | Compatibility |
158 | ------------- | 162 | ------------- |
159 | 163 | ||
160 | Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`. | 164 | Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`. |
161 | Ext3 is fully compatible with Ext2. Ext3 partitions can easily be | 165 | Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as |
162 | mounted as Ext2. | 166 | Ext2. |
167 | |||
163 | 168 | ||
164 | External Tools | 169 | External Tools |
165 | ============== | 170 | ============== |
166 | see manual pages to know more. | 171 | See manual pages to learn more. |
172 | |||
173 | tune2fs: create a ext3 journal on a ext2 partition with the -j flag. | ||
174 | mke2fs: create a ext3 partition with the -j flag. | ||
175 | debugfs: ext2 and ext3 file system debugger. | ||
176 | ext2online: online (mounted) ext2 and ext3 filesystem resizer | ||
167 | 177 | ||
168 | tune2fs: create a ext3 journal on a ext2 partition with the -j flags | ||
169 | mke2fs: create a ext3 partition with the -j flags | ||
170 | debugfs: ext2 and ext3 file system debugger | ||
171 | 178 | ||
172 | References | 179 | References |
173 | ========== | 180 | ========== |
174 | 181 | ||
175 | kernel source: file:/usr/src/linux/fs/ext3 | 182 | kernel source: <file:fs/ext3/> |
176 | file:/usr/src/linux/fs/jbd | 183 | <file:fs/jbd/> |
177 | 184 | ||
178 | programs: http://e2fsprogs.sourceforge.net | 185 | programs: http://e2fsprogs.sourceforge.net/ |
186 | http://ext2resize.sourceforge.net | ||
179 | 187 | ||
180 | useful link: | 188 | useful links: http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html |
181 | http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html | ||
182 | http://www-106.ibm.com/developerworks/linux/library/l-fs7/ | 189 | http://www-106.ibm.com/developerworks/linux/library/l-fs7/ |
183 | http://www-106.ibm.com/developerworks/linux/library/l-fs8/ | 190 | http://www-106.ibm.com/developerworks/linux/library/l-fs8/ |
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt index 6b5741e651a2..33f74310d161 100644 --- a/Documentation/filesystems/fuse.txt +++ b/Documentation/filesystems/fuse.txt | |||
@@ -86,6 +86,62 @@ Mount options | |||
86 | The default is infinite. Note that the size of read requests is | 86 | The default is infinite. Note that the size of read requests is |
87 | limited anyway to 32 pages (which is 128kbyte on i386). | 87 | limited anyway to 32 pages (which is 128kbyte on i386). |
88 | 88 | ||
89 | Sysfs | ||
90 | ~~~~~ | ||
91 | |||
92 | FUSE sets up the following hierarchy in sysfs: | ||
93 | |||
94 | /sys/fs/fuse/connections/N/ | ||
95 | |||
96 | where N is an increasing number allocated to each new connection. | ||
97 | |||
98 | For each connection the following attributes are defined: | ||
99 | |||
100 | 'waiting' | ||
101 | |||
102 | The number of requests which are waiting to be transfered to | ||
103 | userspace or being processed by the filesystem daemon. If there is | ||
104 | no filesystem activity and 'waiting' is non-zero, then the | ||
105 | filesystem is hung or deadlocked. | ||
106 | |||
107 | 'abort' | ||
108 | |||
109 | Writing anything into this file will abort the filesystem | ||
110 | connection. This means that all waiting requests will be aborted an | ||
111 | error returned for all aborted and new requests. | ||
112 | |||
113 | Only a privileged user may read or write these attributes. | ||
114 | |||
115 | Aborting a filesystem connection | ||
116 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
117 | |||
118 | It is possible to get into certain situations where the filesystem is | ||
119 | not responding. Reasons for this may be: | ||
120 | |||
121 | a) Broken userspace filesystem implementation | ||
122 | |||
123 | b) Network connection down | ||
124 | |||
125 | c) Accidental deadlock | ||
126 | |||
127 | d) Malicious deadlock | ||
128 | |||
129 | (For more on c) and d) see later sections) | ||
130 | |||
131 | In either of these cases it may be useful to abort the connection to | ||
132 | the filesystem. There are several ways to do this: | ||
133 | |||
134 | - Kill the filesystem daemon. Works in case of a) and b) | ||
135 | |||
136 | - Kill the filesystem daemon and all users of the filesystem. Works | ||
137 | in all cases except some malicious deadlocks | ||
138 | |||
139 | - Use forced umount (umount -f). Works in all cases but only if | ||
140 | filesystem is still attached (it hasn't been lazy unmounted) | ||
141 | |||
142 | - Abort filesystem through the sysfs interface. Most powerful | ||
143 | method, always works. | ||
144 | |||
89 | How do non-privileged mounts work? | 145 | How do non-privileged mounts work? |
90 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 146 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
91 | 147 | ||
@@ -313,3 +369,10 @@ faulted with get_user_pages(). The 'req->locked' flag indicates | |||
313 | when the copy is taking place, and interruption is delayed until | 369 | when the copy is taking place, and interruption is delayed until |
314 | this flag is unset. | 370 | this flag is unset. |
315 | 371 | ||
372 | Scenario 3 - Tricky deadlock with asynchronous read | ||
373 | --------------------------------------------------- | ||
374 | |||
375 | The same situation as above, except thread-1 will wait on page lock | ||
376 | and hence it will be uninterruptible as well. The solution is to | ||
377 | abort the connection with forced umount (if mount is attached) or | ||
378 | through the abort attribute in sysfs. | ||
diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt new file mode 100644 index 000000000000..f2595caf052e --- /dev/null +++ b/Documentation/filesystems/ocfs2.txt | |||
@@ -0,0 +1,55 @@ | |||
1 | OCFS2 filesystem | ||
2 | ================== | ||
3 | OCFS2 is a general purpose extent based shared disk cluster file | ||
4 | system with many similarities to ext3. It supports 64 bit inode | ||
5 | numbers, and has automatically extending metadata groups which may | ||
6 | also make it attractive for non-clustered use. | ||
7 | |||
8 | You'll want to install the ocfs2-tools package in order to at least | ||
9 | get "mount.ocfs2" and "ocfs2_hb_ctl". | ||
10 | |||
11 | Project web page: http://oss.oracle.com/projects/ocfs2 | ||
12 | Tools web page: http://oss.oracle.com/projects/ocfs2-tools | ||
13 | OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ | ||
14 | |||
15 | All code copyright 2005 Oracle except when otherwise noted. | ||
16 | |||
17 | CREDITS: | ||
18 | Lots of code taken from ext3 and other projects. | ||
19 | |||
20 | Authors in alphabetical order: | ||
21 | Joel Becker <joel.becker@oracle.com> | ||
22 | Zach Brown <zach.brown@oracle.com> | ||
23 | Mark Fasheh <mark.fasheh@oracle.com> | ||
24 | Kurt Hackel <kurt.hackel@oracle.com> | ||
25 | Sunil Mushran <sunil.mushran@oracle.com> | ||
26 | Manish Singh <manish.singh@oracle.com> | ||
27 | |||
28 | Caveats | ||
29 | ======= | ||
30 | Features which OCFS2 does not support yet: | ||
31 | - sparse files | ||
32 | - extended attributes | ||
33 | - shared writeable mmap | ||
34 | - loopback is supported, but data written will not | ||
35 | be cluster coherent. | ||
36 | - quotas | ||
37 | - cluster aware flock | ||
38 | - Directory change notification (F_NOTIFY) | ||
39 | - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) | ||
40 | - POSIX ACLs | ||
41 | - readpages / writepages (not user visible) | ||
42 | |||
43 | Mount options | ||
44 | ============= | ||
45 | |||
46 | OCFS2 supports the following mount options: | ||
47 | (*) == default | ||
48 | |||
49 | barrier=1 This enables/disables barriers. barrier=0 disables it, | ||
50 | barrier=1 enables it. | ||
51 | errors=remount-ro(*) Remount the filesystem read-only on an error. | ||
52 | errors=panic Panic and halt the machine if an error occurs. | ||
53 | intr (*) Allow signals to interrupt cluster operations. | ||
54 | nointr Do not allow signals to interrupt cluster | ||
55 | operations. | ||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d4773565ea2f..944cf109a6f5 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -418,7 +418,7 @@ VmallocChunk: 111088 kB | |||
418 | Dirty: Memory which is waiting to get written back to the disk | 418 | Dirty: Memory which is waiting to get written back to the disk |
419 | Writeback: Memory which is actively being written back to the disk | 419 | Writeback: Memory which is actively being written back to the disk |
420 | Mapped: files which have been mmaped, such as libraries | 420 | Mapped: files which have been mmaped, such as libraries |
421 | Slab: in-kernel data structures cache | 421 | Slab: in-kernel data structures cache |
422 | CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), | 422 | CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), |
423 | this is the total amount of memory currently available to | 423 | this is the total amount of memory currently available to |
424 | be allocated on the system. This limit is only adhered to | 424 | be allocated on the system. This limit is only adhered to |
@@ -1302,6 +1302,23 @@ VM has token based thrashing control mechanism and uses the token to prevent | |||
1302 | unnecessary page faults in thrashing situation. The unit of the value is | 1302 | unnecessary page faults in thrashing situation. The unit of the value is |
1303 | second. The value would be useful to tune thrashing behavior. | 1303 | second. The value would be useful to tune thrashing behavior. |
1304 | 1304 | ||
1305 | drop_caches | ||
1306 | ----------- | ||
1307 | |||
1308 | Writing to this will cause the kernel to drop clean caches, dentries and | ||
1309 | inodes from memory, causing that memory to become free. | ||
1310 | |||
1311 | To free pagecache: | ||
1312 | echo 1 > /proc/sys/vm/drop_caches | ||
1313 | To free dentries and inodes: | ||
1314 | echo 2 > /proc/sys/vm/drop_caches | ||
1315 | To free pagecache, dentries and inodes: | ||
1316 | echo 3 > /proc/sys/vm/drop_caches | ||
1317 | |||
1318 | As this is a non-destructive operation and dirty objects are not freeable, the | ||
1319 | user should run `sync' first. | ||
1320 | |||
1321 | |||
1305 | 2.5 /proc/sys/dev - Device specific parameters | 1322 | 2.5 /proc/sys/dev - Device specific parameters |
1306 | ---------------------------------------------- | 1323 | ---------------------------------------------- |
1307 | 1324 | ||
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt index b3404a032596..60ab61e54e8a 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt | |||
@@ -143,12 +143,26 @@ as the following example: | |||
143 | dir /mnt 755 0 0 | 143 | dir /mnt 755 0 0 |
144 | file /init initramfs/init.sh 755 0 0 | 144 | file /init initramfs/init.sh 755 0 0 |
145 | 145 | ||
146 | Run "usr/gen_init_cpio" (after the kernel build) to get a usage message | ||
147 | documenting the above file format. | ||
148 | |||
146 | One advantage of the text file is that root access is not required to | 149 | One advantage of the text file is that root access is not required to |
147 | set permissions or create device nodes in the new archive. (Note that those | 150 | set permissions or create device nodes in the new archive. (Note that those |
148 | two example "file" entries expect to find files named "init.sh" and "busybox" in | 151 | two example "file" entries expect to find files named "init.sh" and "busybox" in |
149 | a directory called "initramfs", under the linux-2.6.* directory. See | 152 | a directory called "initramfs", under the linux-2.6.* directory. See |
150 | Documentation/early-userspace/README for more details.) | 153 | Documentation/early-userspace/README for more details.) |
151 | 154 | ||
155 | The kernel does not depend on external cpio tools, gen_init_cpio is created | ||
156 | from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's | ||
157 | boot-time extractor is also (obviously) self-contained. However, if you _do_ | ||
158 | happen to have cpio installed, the following command line can extract the | ||
159 | generated cpio image back into its component files: | ||
160 | |||
161 | cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames | ||
162 | |||
163 | Contents of initramfs: | ||
164 | ---------------------- | ||
165 | |||
152 | If you don't already understand what shared libraries, devices, and paths | 166 | If you don't already understand what shared libraries, devices, and paths |
153 | you need to get a minimal root filesystem up and running, here are some | 167 | you need to get a minimal root filesystem up and running, here are some |
154 | references: | 168 | references: |
@@ -161,13 +175,69 @@ designed to be a tiny C library to statically link early userspace | |||
161 | code against, along with some related utilities. It is BSD licensed. | 175 | code against, along with some related utilities. It is BSD licensed. |
162 | 176 | ||
163 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) | 177 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) |
164 | myself. These are LGPL and GPL, respectively. | 178 | myself. These are LGPL and GPL, respectively. (A self-contained initramfs |
179 | package is planned for the busybox 1.2 release.) | ||
165 | 180 | ||
166 | In theory you could use glibc, but that's not well suited for small embedded | 181 | In theory you could use glibc, but that's not well suited for small embedded |
167 | uses like this. (A "hello world" program statically linked against glibc is | 182 | uses like this. (A "hello world" program statically linked against glibc is |
168 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do | 183 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do |
169 | name lookups, even when otherwise statically linked.) | 184 | name lookups, even when otherwise statically linked.) |
170 | 185 | ||
186 | Why cpio rather than tar? | ||
187 | ------------------------- | ||
188 | |||
189 | This decision was made back in December, 2001. The discussion started here: | ||
190 | |||
191 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html | ||
192 | |||
193 | And spawned a second thread (specifically on tar vs cpio), starting here: | ||
194 | |||
195 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html | ||
196 | |||
197 | The quick and dirty summary version (which is no substitute for reading | ||
198 | the above threads) is: | ||
199 | |||
200 | 1) cpio is a standard. It's decades old (from the AT&T days), and already | ||
201 | widely used on Linux (inside RPM, Red Hat's device driver disks). Here's | ||
202 | a Linux Journal article about it from 1996: | ||
203 | |||
204 | http://www.linuxjournal.com/article/1213 | ||
205 | |||
206 | It's not as popular as tar because the traditional cpio command line tools | ||
207 | require _truly_hideous_ command line arguments. But that says nothing | ||
208 | either way about the archive format, and there are alternative tools, | ||
209 | such as: | ||
210 | |||
211 | http://freshmeat.net/projects/afio/ | ||
212 | |||
213 | 2) The cpio archive format chosen by the kernel is simpler and cleaner (and | ||
214 | thus easier to create and parse) than any of the (literally dozens of) | ||
215 | various tar archive formats. The complete initramfs archive format is | ||
216 | explained in buffer-format.txt, created in usr/gen_init_cpio.c, and | ||
217 | extracted in init/initramfs.c. All three together come to less than 26k | ||
218 | total of human-readable text. | ||
219 | |||
220 | 3) The GNU project standardizing on tar is approximately as relevant as | ||
221 | Windows standardizing on zip. Linux is not part of either, and is free | ||
222 | to make its own technical decisions. | ||
223 | |||
224 | 4) Since this is a kernel internal format, it could easily have been | ||
225 | something brand new. The kernel provides its own tools to create and | ||
226 | extract this format anyway. Using an existing standard was preferable, | ||
227 | but not essential. | ||
228 | |||
229 | 5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be | ||
230 | supported on the kernel side"): | ||
231 | |||
232 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html | ||
233 | |||
234 | explained his reasoning: | ||
235 | |||
236 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html | ||
237 | http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html | ||
238 | |||
239 | and, most importantly, designed and implemented the initramfs code. | ||
240 | |||
171 | Future directions: | 241 | Future directions: |
172 | ------------------ | 242 | ------------------ |
173 | 243 | ||
diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt index d803abed29f0..5832377b7340 100644 --- a/Documentation/filesystems/relayfs.txt +++ b/Documentation/filesystems/relayfs.txt | |||
@@ -44,30 +44,41 @@ relayfs can operate in a mode where it will overwrite data not yet | |||
44 | collected by userspace, and not wait for it to consume it. | 44 | collected by userspace, and not wait for it to consume it. |
45 | 45 | ||
46 | relayfs itself does not provide for communication of such data between | 46 | relayfs itself does not provide for communication of such data between |
47 | userspace and kernel, allowing the kernel side to remain simple and not | 47 | userspace and kernel, allowing the kernel side to remain simple and |
48 | impose a single interface on userspace. It does provide a separate | 48 | not impose a single interface on userspace. It does provide a set of |
49 | helper though, described below. | 49 | examples and a separate helper though, described below. |
50 | |||
51 | klog and relay-apps example code | ||
52 | ================================ | ||
53 | |||
54 | relayfs itself is ready to use, but to make things easier, a couple | ||
55 | simple utility functions and a set of examples are provided. | ||
56 | |||
57 | The relay-apps example tarball, available on the relayfs sourceforge | ||
58 | site, contains a set of self-contained examples, each consisting of a | ||
59 | pair of .c files containing boilerplate code for each of the user and | ||
60 | kernel sides of a relayfs application; combined these two sets of | ||
61 | boilerplate code provide glue to easily stream data to disk, without | ||
62 | having to bother with mundane housekeeping chores. | ||
63 | |||
64 | The 'klog debugging functions' patch (klog.patch in the relay-apps | ||
65 | tarball) provides a couple of high-level logging functions to the | ||
66 | kernel which allow writing formatted text or raw data to a channel, | ||
67 | regardless of whether a channel to write into exists or not, or | ||
68 | whether relayfs is compiled into the kernel or is configured as a | ||
69 | module. These functions allow you to put unconditional 'trace' | ||
70 | statements anywhere in the kernel or kernel modules; only when there | ||
71 | is a 'klog handler' registered will data actually be logged (see the | ||
72 | klog and kleak examples for details). | ||
73 | |||
74 | It is of course possible to use relayfs from scratch i.e. without | ||
75 | using any of the relay-apps example code or klog, but you'll have to | ||
76 | implement communication between userspace and kernel, allowing both to | ||
77 | convey the state of buffers (full, empty, amount of padding). | ||
78 | |||
79 | klog and the relay-apps examples can be found in the relay-apps | ||
80 | tarball on http://relayfs.sourceforge.net | ||
50 | 81 | ||
51 | klog, relay-app & librelay | ||
52 | ========================== | ||
53 | |||
54 | relayfs itself is ready to use, but to make things easier, two | ||
55 | additional systems are provided. klog is a simple wrapper to make | ||
56 | writing formatted text or raw data to a channel simpler, regardless of | ||
57 | whether a channel to write into exists or not, or whether relayfs is | ||
58 | compiled into the kernel or is configured as a module. relay-app is | ||
59 | the kernel counterpart of userspace librelay.c, combined these two | ||
60 | files provide glue to easily stream data to disk, without having to | ||
61 | bother with housekeeping. klog and relay-app can be used together, | ||
62 | with klog providing high-level logging functions to the kernel and | ||
63 | relay-app taking care of kernel-user control and disk-logging chores. | ||
64 | |||
65 | It is possible to use relayfs without relay-app & librelay, but you'll | ||
66 | have to implement communication between userspace and kernel, allowing | ||
67 | both to convey the state of buffers (full, empty, amount of padding). | ||
68 | |||
69 | klog, relay-app and librelay can be found in the relay-apps tarball on | ||
70 | http://relayfs.sourceforge.net | ||
71 | 82 | ||
72 | The relayfs user space API | 83 | The relayfs user space API |
73 | ========================== | 84 | ========================== |
@@ -125,6 +136,8 @@ Here's a summary of the API relayfs provides to in-kernel clients: | |||
125 | relay_reset(chan) | 136 | relay_reset(chan) |
126 | relayfs_create_dir(name, parent) | 137 | relayfs_create_dir(name, parent) |
127 | relayfs_remove_dir(dentry) | 138 | relayfs_remove_dir(dentry) |
139 | relayfs_create_file(name, parent, mode, fops, data) | ||
140 | relayfs_remove_file(dentry) | ||
128 | 141 | ||
129 | channel management typically called on instigation of userspace: | 142 | channel management typically called on instigation of userspace: |
130 | 143 | ||
@@ -141,6 +154,8 @@ Here's a summary of the API relayfs provides to in-kernel clients: | |||
141 | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | 154 | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) |
142 | buf_mapped(buf, filp) | 155 | buf_mapped(buf, filp) |
143 | buf_unmapped(buf, filp) | 156 | buf_unmapped(buf, filp) |
157 | create_buf_file(filename, parent, mode, buf, is_global) | ||
158 | remove_buf_file(dentry) | ||
144 | 159 | ||
145 | helper functions: | 160 | helper functions: |
146 | 161 | ||
@@ -320,6 +335,71 @@ forces a sub-buffer switch on all the channel buffers, and can be used | |||
320 | to finalize and process the last sub-buffers before the channel is | 335 | to finalize and process the last sub-buffers before the channel is |
321 | closed. | 336 | closed. |
322 | 337 | ||
338 | Creating non-relay files | ||
339 | ------------------------ | ||
340 | |||
341 | relay_open() automatically creates files in the relayfs filesystem to | ||
342 | represent the per-cpu kernel buffers; it's often useful for | ||
343 | applications to be able to create their own files alongside the relay | ||
344 | files in the relayfs filesystem as well e.g. 'control' files much like | ||
345 | those created in /proc or debugfs for similar purposes, used to | ||
346 | communicate control information between the kernel and user sides of a | ||
347 | relayfs application. For this purpose the relayfs_create_file() and | ||
348 | relayfs_remove_file() API functions exist. For relayfs_create_file(), | ||
349 | the caller passes in a set of user-defined file operations to be used | ||
350 | for the file and an optional void * to a user-specified data item, | ||
351 | which will be accessible via inode->u.generic_ip (see the relay-apps | ||
352 | tarball for examples). The file_operations are a required parameter | ||
353 | to relayfs_create_file() and thus the semantics of these files are | ||
354 | completely defined by the caller. | ||
355 | |||
356 | See the relay-apps tarball at http://relayfs.sourceforge.net for | ||
357 | examples of how these non-relay files are meant to be used. | ||
358 | |||
359 | Creating relay files in other filesystems | ||
360 | ----------------------------------------- | ||
361 | |||
362 | By default of course, relay_open() creates relay files in the relayfs | ||
363 | filesystem. Because relay_file_operations is exported, however, it's | ||
364 | also possible to create and use relay files in other pseudo-filesytems | ||
365 | such as debugfs. | ||
366 | |||
367 | For this purpose, two callback functions are provided, | ||
368 | create_buf_file() and remove_buf_file(). create_buf_file() is called | ||
369 | once for each per-cpu buffer from relay_open() to allow the client to | ||
370 | create a file to be used to represent the corresponding buffer; if | ||
371 | this callback is not defined, the default implementation will create | ||
372 | and return a file in the relayfs filesystem to represent the buffer. | ||
373 | The callback should return the dentry of the file created to represent | ||
374 | the relay buffer. Note that the parent directory passed to | ||
375 | relay_open() (and passed along to the callback), if specified, must | ||
376 | exist in the same filesystem the new relay file is created in. If | ||
377 | create_buf_file() is defined, remove_buf_file() must also be defined; | ||
378 | it's responsible for deleting the file(s) created in create_buf_file() | ||
379 | and is called during relay_close(). | ||
380 | |||
381 | The create_buf_file() implementation can also be defined in such a way | ||
382 | as to allow the creation of a single 'global' buffer instead of the | ||
383 | default per-cpu set. This can be useful for applications interested | ||
384 | mainly in seeing the relative ordering of system-wide events without | ||
385 | the need to bother with saving explicit timestamps for the purpose of | ||
386 | merging/sorting per-cpu files in a postprocessing step. | ||
387 | |||
388 | To have relay_open() create a global buffer, the create_buf_file() | ||
389 | implementation should set the value of the is_global outparam to a | ||
390 | non-zero value in addition to creating the file that will be used to | ||
391 | represent the single buffer. In the case of a global buffer, | ||
392 | create_buf_file() and remove_buf_file() will be called only once. The | ||
393 | normal channel-writing functions e.g. relay_write() can still be used | ||
394 | - writes from any cpu will transparently end up in the global buffer - | ||
395 | but since it is a global buffer, callers should make sure they use the | ||
396 | proper locking for such a buffer, either by wrapping writes in a | ||
397 | spinlock, or by copying a write function from relayfs_fs.h and | ||
398 | creating a local version that internally does the proper locking. | ||
399 | |||
400 | See the 'exported-relayfile' examples in the relay-apps tarball for | ||
401 | examples of creating and using relay files in debugfs. | ||
402 | |||
323 | Misc | 403 | Misc |
324 | ---- | 404 | ---- |
325 | 405 | ||
diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs.txt new file mode 100644 index 000000000000..8edc3952eff4 --- /dev/null +++ b/Documentation/filesystems/spufs.txt | |||
@@ -0,0 +1,521 @@ | |||
1 | SPUFS(2) Linux Programmer's Manual SPUFS(2) | ||
2 | |||
3 | |||
4 | |||
5 | NAME | ||
6 | spufs - the SPU file system | ||
7 | |||
8 | |||
9 | DESCRIPTION | ||
10 | The SPU file system is used on PowerPC machines that implement the Cell | ||
11 | Broadband Engine Architecture in order to access Synergistic Processor | ||
12 | Units (SPUs). | ||
13 | |||
14 | The file system provides a name space similar to posix shared memory or | ||
15 | message queues. Users that have write permissions on the file system | ||
16 | can use spu_create(2) to establish SPU contexts in the spufs root. | ||
17 | |||
18 | Every SPU context is represented by a directory containing a predefined | ||
19 | set of files. These files can be used for manipulating the state of the | ||
20 | logical SPU. Users can change permissions on those files, but not actu- | ||
21 | ally add or remove files. | ||
22 | |||
23 | |||
24 | MOUNT OPTIONS | ||
25 | uid=<uid> | ||
26 | set the user owning the mount point, the default is 0 (root). | ||
27 | |||
28 | gid=<gid> | ||
29 | set the group owning the mount point, the default is 0 (root). | ||
30 | |||
31 | |||
32 | FILES | ||
33 | The files in spufs mostly follow the standard behavior for regular sys- | ||
34 | tem calls like read(2) or write(2), but often support only a subset of | ||
35 | the operations supported on regular file systems. This list details the | ||
36 | supported operations and the deviations from the behaviour in the | ||
37 | respective man pages. | ||
38 | |||
39 | All files that support the read(2) operation also support readv(2) and | ||
40 | all files that support the write(2) operation also support writev(2). | ||
41 | All files support the access(2) and stat(2) family of operations, but | ||
42 | only the st_mode, st_nlink, st_uid and st_gid fields of struct stat | ||
43 | contain reliable information. | ||
44 | |||
45 | All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera- | ||
46 | tions, but will not be able to grant permissions that contradict the | ||
47 | possible operations, e.g. read access on the wbox file. | ||
48 | |||
49 | The current set of files is: | ||
50 | |||
51 | |||
52 | /mem | ||
53 | the contents of the local storage memory of the SPU. This can be | ||
54 | accessed like a regular shared memory file and contains both code and | ||
55 | data in the address space of the SPU. The possible operations on an | ||
56 | open mem file are: | ||
57 | |||
58 | read(2), pread(2), write(2), pwrite(2), lseek(2) | ||
59 | These operate as documented, with the exception that seek(2), | ||
60 | write(2) and pwrite(2) are not supported beyond the end of the | ||
61 | file. The file size is the size of the local storage of the SPU, | ||
62 | which normally is 256 kilobytes. | ||
63 | |||
64 | mmap(2) | ||
65 | Mapping mem into the process address space gives access to the | ||
66 | SPU local storage within the process address space. Only | ||
67 | MAP_SHARED mappings are allowed. | ||
68 | |||
69 | |||
70 | /mbox | ||
71 | The first SPU to CPU communication mailbox. This file is read-only and | ||
72 | can be read in units of 32 bits. The file can only be used in non- | ||
73 | blocking mode and it even poll() will not block on it. The possible | ||
74 | operations on an open mbox file are: | ||
75 | |||
76 | read(2) | ||
77 | If a count smaller than four is requested, read returns -1 and | ||
78 | sets errno to EINVAL. If there is no data available in the mail | ||
79 | box, the return value is set to -1 and errno becomes EAGAIN. | ||
80 | When data has been read successfully, four bytes are placed in | ||
81 | the data buffer and the value four is returned. | ||
82 | |||
83 | |||
84 | /ibox | ||
85 | The second SPU to CPU communication mailbox. This file is similar to | ||
86 | the first mailbox file, but can be read in blocking I/O mode, and the | ||
87 | poll familiy of system calls can be used to wait for it. The possible | ||
88 | operations on an open ibox file are: | ||
89 | |||
90 | read(2) | ||
91 | If a count smaller than four is requested, read returns -1 and | ||
92 | sets errno to EINVAL. If there is no data available in the mail | ||
93 | box and the file descriptor has been opened with O_NONBLOCK, the | ||
94 | return value is set to -1 and errno becomes EAGAIN. | ||
95 | |||
96 | If there is no data available in the mail box and the file | ||
97 | descriptor has been opened without O_NONBLOCK, the call will | ||
98 | block until the SPU writes to its interrupt mailbox channel. | ||
99 | When data has been read successfully, four bytes are placed in | ||
100 | the data buffer and the value four is returned. | ||
101 | |||
102 | poll(2) | ||
103 | Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever | ||
104 | data is available for reading. | ||
105 | |||
106 | |||
107 | /wbox | ||
108 | The CPU to SPU communation mailbox. It is write-only can can be written | ||
109 | in units of 32 bits. If the mailbox is full, write() will block and | ||
110 | poll can be used to wait for it becoming empty again. The possible | ||
111 | operations on an open wbox file are: write(2) If a count smaller than | ||
112 | four is requested, write returns -1 and sets errno to EINVAL. If there | ||
113 | is no space available in the mail box and the file descriptor has been | ||
114 | opened with O_NONBLOCK, the return value is set to -1 and errno becomes | ||
115 | EAGAIN. | ||
116 | |||
117 | If there is no space available in the mail box and the file descriptor | ||
118 | has been opened without O_NONBLOCK, the call will block until the SPU | ||
119 | reads from its PPE mailbox channel. When data has been read success- | ||
120 | fully, four bytes are placed in the data buffer and the value four is | ||
121 | returned. | ||
122 | |||
123 | poll(2) | ||
124 | Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever | ||
125 | space is available for writing. | ||
126 | |||
127 | |||
128 | /mbox_stat | ||
129 | /ibox_stat | ||
130 | /wbox_stat | ||
131 | Read-only files that contain the length of the current queue, i.e. how | ||
132 | many words can be read from mbox or ibox or how many words can be | ||
133 | written to wbox without blocking. The files can be read only in 4-byte | ||
134 | units and return a big-endian binary integer number. The possible | ||
135 | operations on an open *box_stat file are: | ||
136 | |||
137 | read(2) | ||
138 | If a count smaller than four is requested, read returns -1 and | ||
139 | sets errno to EINVAL. Otherwise, a four byte value is placed in | ||
140 | the data buffer, containing the number of elements that can be | ||
141 | read from (for mbox_stat and ibox_stat) or written to (for | ||
142 | wbox_stat) the respective mail box without blocking or resulting | ||
143 | in EAGAIN. | ||
144 | |||
145 | |||
146 | /npc | ||
147 | /decr | ||
148 | /decr_status | ||
149 | /spu_tag_mask | ||
150 | /event_mask | ||
151 | /srr0 | ||
152 | Internal registers of the SPU. The representation is an ASCII string | ||
153 | with the numeric value of the next instruction to be executed. These | ||
154 | can be used in read/write mode for debugging, but normal operation of | ||
155 | programs should not rely on them because access to any of them except | ||
156 | npc requires an SPU context save and is therefore very inefficient. | ||
157 | |||
158 | The contents of these files are: | ||
159 | |||
160 | npc Next Program Counter | ||
161 | |||
162 | decr SPU Decrementer | ||
163 | |||
164 | decr_status Decrementer Status | ||
165 | |||
166 | spu_tag_mask MFC tag mask for SPU DMA | ||
167 | |||
168 | event_mask Event mask for SPU interrupts | ||
169 | |||
170 | srr0 Interrupt Return address register | ||
171 | |||
172 | |||
173 | The possible operations on an open npc, decr, decr_status, | ||
174 | spu_tag_mask, event_mask or srr0 file are: | ||
175 | |||
176 | read(2) | ||
177 | When the count supplied to the read call is shorter than the | ||
178 | required length for the pointer value plus a newline character, | ||
179 | subsequent reads from the same file descriptor will result in | ||
180 | completing the string, regardless of changes to the register by | ||
181 | a running SPU task. When a complete string has been read, all | ||
182 | subsequent read operations will return zero bytes and a new file | ||
183 | descriptor needs to be opened to read the value again. | ||
184 | |||
185 | write(2) | ||
186 | A write operation on the file results in setting the register to | ||
187 | the value given in the string. The string is parsed from the | ||
188 | beginning to the first non-numeric character or the end of the | ||
189 | buffer. Subsequent writes to the same file descriptor overwrite | ||
190 | the previous setting. | ||
191 | |||
192 | |||
193 | /fpcr | ||
194 | This file gives access to the Floating Point Status and Control Regis- | ||
195 | ter as a four byte long file. The operations on the fpcr file are: | ||
196 | |||
197 | read(2) | ||
198 | If a count smaller than four is requested, read returns -1 and | ||
199 | sets errno to EINVAL. Otherwise, a four byte value is placed in | ||
200 | the data buffer, containing the current value of the fpcr regis- | ||
201 | ter. | ||
202 | |||
203 | write(2) | ||
204 | If a count smaller than four is requested, write returns -1 and | ||
205 | sets errno to EINVAL. Otherwise, a four byte value is copied | ||
206 | from the data buffer, updating the value of the fpcr register. | ||
207 | |||
208 | |||
209 | /signal1 | ||
210 | /signal2 | ||
211 | The two signal notification channels of an SPU. These are read-write | ||
212 | files that operate on a 32 bit word. Writing to one of these files | ||
213 | triggers an interrupt on the SPU. The value writting to the signal | ||
214 | files can be read from the SPU through a channel read or from host user | ||
215 | space through the file. After the value has been read by the SPU, it | ||
216 | is reset to zero. The possible operations on an open signal1 or sig- | ||
217 | nal2 file are: | ||
218 | |||
219 | read(2) | ||
220 | If a count smaller than four is requested, read returns -1 and | ||
221 | sets errno to EINVAL. Otherwise, a four byte value is placed in | ||
222 | the data buffer, containing the current value of the specified | ||
223 | signal notification register. | ||
224 | |||
225 | write(2) | ||
226 | If a count smaller than four is requested, write returns -1 and | ||
227 | sets errno to EINVAL. Otherwise, a four byte value is copied | ||
228 | from the data buffer, updating the value of the specified signal | ||
229 | notification register. The signal notification register will | ||
230 | either be replaced with the input data or will be updated to the | ||
231 | bitwise OR or the old value and the input data, depending on the | ||
232 | contents of the signal1_type, or signal2_type respectively, | ||
233 | file. | ||
234 | |||
235 | |||
236 | /signal1_type | ||
237 | /signal2_type | ||
238 | These two files change the behavior of the signal1 and signal2 notifi- | ||
239 | cation files. The contain a numerical ASCII string which is read as | ||
240 | either "1" or "0". In mode 0 (overwrite), the hardware replaces the | ||
241 | contents of the signal channel with the data that is written to it. in | ||
242 | mode 1 (logical OR), the hardware accumulates the bits that are subse- | ||
243 | quently written to it. The possible operations on an open signal1_type | ||
244 | or signal2_type file are: | ||
245 | |||
246 | read(2) | ||
247 | When the count supplied to the read call is shorter than the | ||
248 | required length for the digit plus a newline character, subse- | ||
249 | quent reads from the same file descriptor will result in com- | ||
250 | pleting the string. When a complete string has been read, all | ||
251 | subsequent read operations will return zero bytes and a new file | ||
252 | descriptor needs to be opened to read the value again. | ||
253 | |||
254 | write(2) | ||
255 | A write operation on the file results in setting the register to | ||
256 | the value given in the string. The string is parsed from the | ||
257 | beginning to the first non-numeric character or the end of the | ||
258 | buffer. Subsequent writes to the same file descriptor overwrite | ||
259 | the previous setting. | ||
260 | |||
261 | |||
262 | EXAMPLES | ||
263 | /etc/fstab entry | ||
264 | none /spu spufs gid=spu 0 0 | ||
265 | |||
266 | |||
267 | AUTHORS | ||
268 | Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>, | ||
269 | Ulrich Weigand <Ulrich.Weigand@de.ibm.com> | ||
270 | |||
271 | SEE ALSO | ||
272 | capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7) | ||
273 | |||
274 | |||
275 | |||
276 | Linux 2005-09-28 SPUFS(2) | ||
277 | |||
278 | ------------------------------------------------------------------------------ | ||
279 | |||
280 | SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2) | ||
281 | |||
282 | |||
283 | |||
284 | NAME | ||
285 | spu_run - execute an spu context | ||
286 | |||
287 | |||
288 | SYNOPSIS | ||
289 | #include <sys/spu.h> | ||
290 | |||
291 | int spu_run(int fd, unsigned int *npc, unsigned int *event); | ||
292 | |||
293 | DESCRIPTION | ||
294 | The spu_run system call is used on PowerPC machines that implement the | ||
295 | Cell Broadband Engine Architecture in order to access Synergistic Pro- | ||
296 | cessor Units (SPUs). It uses the fd that was returned from spu_cre- | ||
297 | ate(2) to address a specific SPU context. When the context gets sched- | ||
298 | uled to a physical SPU, it starts execution at the instruction pointer | ||
299 | passed in npc. | ||
300 | |||
301 | Execution of SPU code happens synchronously, meaning that spu_run does | ||
302 | not return while the SPU is still running. If there is a need to exe- | ||
303 | cute SPU code in parallel with other code on either the main CPU or | ||
304 | other SPUs, you need to create a new thread of execution first, e.g. | ||
305 | using the pthread_create(3) call. | ||
306 | |||
307 | When spu_run returns, the current value of the SPU instruction pointer | ||
308 | is written back to npc, so you can call spu_run again without updating | ||
309 | the pointers. | ||
310 | |||
311 | event can be a NULL pointer or point to an extended status code that | ||
312 | gets filled when spu_run returns. It can be one of the following con- | ||
313 | stants: | ||
314 | |||
315 | SPE_EVENT_DMA_ALIGNMENT | ||
316 | A DMA alignment error | ||
317 | |||
318 | SPE_EVENT_SPE_DATA_SEGMENT | ||
319 | A DMA segmentation error | ||
320 | |||
321 | SPE_EVENT_SPE_DATA_STORAGE | ||
322 | A DMA storage error | ||
323 | |||
324 | If NULL is passed as the event argument, these errors will result in a | ||
325 | signal delivered to the calling process. | ||
326 | |||
327 | RETURN VALUE | ||
328 | spu_run returns the value of the spu_status register or -1 to indicate | ||
329 | an error and set errno to one of the error codes listed below. The | ||
330 | spu_status register value contains a bit mask of status codes and | ||
331 | optionally a 14 bit code returned from the stop-and-signal instruction | ||
332 | on the SPU. The bit masks for the status codes are: | ||
333 | |||
334 | 0x02 SPU was stopped by stop-and-signal. | ||
335 | |||
336 | 0x04 SPU was stopped by halt. | ||
337 | |||
338 | 0x08 SPU is waiting for a channel. | ||
339 | |||
340 | 0x10 SPU is in single-step mode. | ||
341 | |||
342 | 0x20 SPU has tried to execute an invalid instruction. | ||
343 | |||
344 | 0x40 SPU has tried to access an invalid channel. | ||
345 | |||
346 | 0x3fff0000 | ||
347 | The bits masked with this value contain the code returned from | ||
348 | stop-and-signal. | ||
349 | |||
350 | There are always one or more of the lower eight bits set or an error | ||
351 | code is returned from spu_run. | ||
352 | |||
353 | ERRORS | ||
354 | EAGAIN or EWOULDBLOCK | ||
355 | fd is in non-blocking mode and spu_run would block. | ||
356 | |||
357 | EBADF fd is not a valid file descriptor. | ||
358 | |||
359 | EFAULT npc is not a valid pointer or status is neither NULL nor a valid | ||
360 | pointer. | ||
361 | |||
362 | EINTR A signal occured while spu_run was in progress. The npc value | ||
363 | has been updated to the new program counter value if necessary. | ||
364 | |||
365 | EINVAL fd is not a file descriptor returned from spu_create(2). | ||
366 | |||
367 | ENOMEM Insufficient memory was available to handle a page fault result- | ||
368 | ing from an MFC direct memory access. | ||
369 | |||
370 | ENOSYS the functionality is not provided by the current system, because | ||
371 | either the hardware does not provide SPUs or the spufs module is | ||
372 | not loaded. | ||
373 | |||
374 | |||
375 | NOTES | ||
376 | spu_run is meant to be used from libraries that implement a more | ||
377 | abstract interface to SPUs, not to be used from regular applications. | ||
378 | See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- | ||
379 | ommended libraries. | ||
380 | |||
381 | |||
382 | CONFORMING TO | ||
383 | This call is Linux specific and only implemented by the ppc64 architec- | ||
384 | ture. Programs using this system call are not portable. | ||
385 | |||
386 | |||
387 | BUGS | ||
388 | The code does not yet fully implement all features lined out here. | ||
389 | |||
390 | |||
391 | AUTHOR | ||
392 | Arnd Bergmann <arndb@de.ibm.com> | ||
393 | |||
394 | SEE ALSO | ||
395 | capabilities(7), close(2), spu_create(2), spufs(7) | ||
396 | |||
397 | |||
398 | |||
399 | Linux 2005-09-28 SPU_RUN(2) | ||
400 | |||
401 | ------------------------------------------------------------------------------ | ||
402 | |||
403 | SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2) | ||
404 | |||
405 | |||
406 | |||
407 | NAME | ||
408 | spu_create - create a new spu context | ||
409 | |||
410 | |||
411 | SYNOPSIS | ||
412 | #include <sys/types.h> | ||
413 | #include <sys/spu.h> | ||
414 | |||
415 | int spu_create(const char *pathname, int flags, mode_t mode); | ||
416 | |||
417 | DESCRIPTION | ||
418 | The spu_create system call is used on PowerPC machines that implement | ||
419 | the Cell Broadband Engine Architecture in order to access Synergistic | ||
420 | Processor Units (SPUs). It creates a new logical context for an SPU in | ||
421 | pathname and returns a handle to associated with it. pathname must | ||
422 | point to a non-existing directory in the mount point of the SPU file | ||
423 | system (spufs). When spu_create is successful, a directory gets cre- | ||
424 | ated on pathname and it is populated with files. | ||
425 | |||
426 | The returned file handle can only be passed to spu_run(2) or closed, | ||
427 | other operations are not defined on it. When it is closed, all associ- | ||
428 | ated directory entries in spufs are removed. When the last file handle | ||
429 | pointing either inside of the context directory or to this file | ||
430 | descriptor is closed, the logical SPU context is destroyed. | ||
431 | |||
432 | The parameter flags can be zero or any bitwise or'd combination of the | ||
433 | following constants: | ||
434 | |||
435 | SPU_RAWIO | ||
436 | Allow mapping of some of the hardware registers of the SPU into | ||
437 | user space. This flag requires the CAP_SYS_RAWIO capability, see | ||
438 | capabilities(7). | ||
439 | |||
440 | The mode parameter specifies the permissions used for creating the new | ||
441 | directory in spufs. mode is modified with the user's umask(2) value | ||
442 | and then used for both the directory and the files contained in it. The | ||
443 | file permissions mask out some more bits of mode because they typically | ||
444 | support only read or write access. See stat(2) for a full list of the | ||
445 | possible mode values. | ||
446 | |||
447 | |||
448 | RETURN VALUE | ||
449 | spu_create returns a new file descriptor. It may return -1 to indicate | ||
450 | an error condition and set errno to one of the error codes listed | ||
451 | below. | ||
452 | |||
453 | |||
454 | ERRORS | ||
455 | EACCESS | ||
456 | The current user does not have write access on the spufs mount | ||
457 | point. | ||
458 | |||
459 | EEXIST An SPU context already exists at the given path name. | ||
460 | |||
461 | EFAULT pathname is not a valid string pointer in the current address | ||
462 | space. | ||
463 | |||
464 | EINVAL pathname is not a directory in the spufs mount point. | ||
465 | |||
466 | ELOOP Too many symlinks were found while resolving pathname. | ||
467 | |||
468 | EMFILE The process has reached its maximum open file limit. | ||
469 | |||
470 | ENAMETOOLONG | ||
471 | pathname was too long. | ||
472 | |||
473 | ENFILE The system has reached the global open file limit. | ||
474 | |||
475 | ENOENT Part of pathname could not be resolved. | ||
476 | |||
477 | ENOMEM The kernel could not allocate all resources required. | ||
478 | |||
479 | ENOSPC There are not enough SPU resources available to create a new | ||
480 | context or the user specific limit for the number of SPU con- | ||
481 | texts has been reached. | ||
482 | |||
483 | ENOSYS the functionality is not provided by the current system, because | ||
484 | either the hardware does not provide SPUs or the spufs module is | ||
485 | not loaded. | ||
486 | |||
487 | ENOTDIR | ||
488 | A part of pathname is not a directory. | ||
489 | |||
490 | |||
491 | |||
492 | NOTES | ||
493 | spu_create is meant to be used from libraries that implement a more | ||
494 | abstract interface to SPUs, not to be used from regular applications. | ||
495 | See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- | ||
496 | ommended libraries. | ||
497 | |||
498 | |||
499 | FILES | ||
500 | pathname must point to a location beneath the mount point of spufs. By | ||
501 | convention, it gets mounted in /spu. | ||
502 | |||
503 | |||
504 | CONFORMING TO | ||
505 | This call is Linux specific and only implemented by the ppc64 architec- | ||
506 | ture. Programs using this system call are not portable. | ||
507 | |||
508 | |||
509 | BUGS | ||
510 | The code does not yet fully implement all features lined out here. | ||
511 | |||
512 | |||
513 | AUTHOR | ||
514 | Arnd Bergmann <arndb@de.ibm.com> | ||
515 | |||
516 | SEE ALSO | ||
517 | capabilities(7), close(2), spu_run(2), spufs(7) | ||
518 | |||
519 | |||
520 | |||
521 | Linux 2005-09-28 SPU_CREATE(2) | ||
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt index 988a62fae11f..7ba2baa165ff 100644 --- a/Documentation/filesystems/sysfs-pci.txt +++ b/Documentation/filesystems/sysfs-pci.txt | |||
@@ -1,4 +1,5 @@ | |||
1 | Accessing PCI device resources through sysfs | 1 | Accessing PCI device resources through sysfs |
2 | -------------------------------------------- | ||
2 | 3 | ||
3 | sysfs, usually mounted at /sys, provides access to PCI resources on platforms | 4 | sysfs, usually mounted at /sys, provides access to PCI resources on platforms |
4 | that support it. For example, a given bus might look like this: | 5 | that support it. For example, a given bus might look like this: |
@@ -47,14 +48,21 @@ files, each with their own function. | |||
47 | binary - file contains binary data | 48 | binary - file contains binary data |
48 | cpumask - file contains a cpumask type | 49 | cpumask - file contains a cpumask type |
49 | 50 | ||
50 | The read only files are informational, writes to them will be ignored. | 51 | The read only files are informational, writes to them will be ignored, with |
51 | Writable files can be used to perform actions on the device (e.g. changing | 52 | the exception of the 'rom' file. Writable files can be used to perform |
52 | config space, detaching a device). mmapable files are available via an | 53 | actions on the device (e.g. changing config space, detaching a device). |
53 | mmap of the file at offset 0 and can be used to do actual device programming | 54 | mmapable files are available via an mmap of the file at offset 0 and can be |
54 | from userspace. Note that some platforms don't support mmapping of certain | 55 | used to do actual device programming from userspace. Note that some platforms |
55 | resources, so be sure to check the return value from any attempted mmap. | 56 | don't support mmapping of certain resources, so be sure to check the return |
57 | value from any attempted mmap. | ||
58 | |||
59 | The 'rom' file is special in that it provides read-only access to the device's | ||
60 | ROM file, if available. It's disabled by default, however, so applications | ||
61 | should write the string "1" to the file to enable it before attempting a read | ||
62 | call, and disable it following the access by writing "0" to the file. | ||
56 | 63 | ||
57 | Accessing legacy resources through sysfs | 64 | Accessing legacy resources through sysfs |
65 | ---------------------------------------- | ||
58 | 66 | ||
59 | Legacy I/O port and ISA memory resources are also provided in sysfs if the | 67 | Legacy I/O port and ISA memory resources are also provided in sysfs if the |
60 | underlying platform supports them. They're located in the PCI class heirarchy, | 68 | underlying platform supports them. They're located in the PCI class heirarchy, |
@@ -75,6 +83,7 @@ simply dereference the returned pointer (after checking for errors of course) | |||
75 | to access legacy memory space. | 83 | to access legacy memory space. |
76 | 84 | ||
77 | Supporting PCI access on new platforms | 85 | Supporting PCI access on new platforms |
86 | -------------------------------------- | ||
78 | 87 | ||
79 | In order to support PCI resource mapping as described above, Linux platform | 88 | In order to support PCI resource mapping as described above, Linux platform |
80 | code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function. | 89 | code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function. |
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 0d783c504ead..dbe4d87d2615 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt | |||
@@ -78,6 +78,18 @@ use up all the memory on the machine; but enhances the scalability of | |||
78 | that instance in a system with many cpus making intensive use of it. | 78 | that instance in a system with many cpus making intensive use of it. |
79 | 79 | ||
80 | 80 | ||
81 | tmpfs has a mount option to set the NUMA memory allocation policy for | ||
82 | all files in that instance: | ||
83 | mpol=interleave prefers to allocate memory from each node in turn | ||
84 | mpol=default prefers to allocate memory from the local node | ||
85 | mpol=bind prefers to allocate from mpol_nodelist | ||
86 | mpol=preferred prefers to allocate from first node in mpol_nodelist | ||
87 | |||
88 | The following mount option is used in conjunction with mpol=interleave, | ||
89 | mpol=bind or mpol=preferred: | ||
90 | mpol_nodelist: nodelist suitable for parsing with nodelist_parse. | ||
91 | |||
92 | |||
81 | To specify the initial root directory you can use the following mount | 93 | To specify the initial root directory you can use the following mount |
82 | options: | 94 | options: |
83 | 95 | ||
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index ee4c0a8b8db7..e56e842847d3 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt | |||
@@ -162,9 +162,8 @@ get_sb() method fills in is the "s_op" field. This is a pointer to | |||
162 | a "struct super_operations" which describes the next level of the | 162 | a "struct super_operations" which describes the next level of the |
163 | filesystem implementation. | 163 | filesystem implementation. |
164 | 164 | ||
165 | Usually, a filesystem uses generic one of the generic get_sb() | 165 | Usually, a filesystem uses one of the generic get_sb() implementations |
166 | implementations and provides a fill_super() method instead. The | 166 | and provides a fill_super() method instead. The generic methods are: |
167 | generic methods are: | ||
168 | 167 | ||
169 | get_sb_bdev: mount a filesystem residing on a block device | 168 | get_sb_bdev: mount a filesystem residing on a block device |
170 | 169 | ||