diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/fuse.txt | 118 | ||||
-rw-r--r-- | Documentation/filesystems/ramfs-rootfs-initramfs.txt | 146 |
2 files changed, 189 insertions, 75 deletions
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt index 33f74310d16..a584f05403a 100644 --- a/Documentation/filesystems/fuse.txt +++ b/Documentation/filesystems/fuse.txt | |||
@@ -18,6 +18,14 @@ Non-privileged mount (or user mount): | |||
18 | user. NOTE: this is not the same as mounts allowed with the "user" | 18 | user. NOTE: this is not the same as mounts allowed with the "user" |
19 | option in /etc/fstab, which is not discussed here. | 19 | option in /etc/fstab, which is not discussed here. |
20 | 20 | ||
21 | Filesystem connection: | ||
22 | |||
23 | A connection between the filesystem daemon and the kernel. The | ||
24 | connection exists until either the daemon dies, or the filesystem is | ||
25 | umounted. Note that detaching (or lazy umounting) the filesystem | ||
26 | does _not_ break the connection, in this case it will exist until | ||
27 | the last reference to the filesystem is released. | ||
28 | |||
21 | Mount owner: | 29 | Mount owner: |
22 | 30 | ||
23 | The user who does the mounting. | 31 | The user who does the mounting. |
@@ -86,16 +94,20 @@ Mount options | |||
86 | The default is infinite. Note that the size of read requests is | 94 | The default is infinite. Note that the size of read requests is |
87 | limited anyway to 32 pages (which is 128kbyte on i386). | 95 | limited anyway to 32 pages (which is 128kbyte on i386). |
88 | 96 | ||
89 | Sysfs | 97 | Control filesystem |
90 | ~~~~~ | 98 | ~~~~~~~~~~~~~~~~~~ |
99 | |||
100 | There's a control filesystem for FUSE, which can be mounted by: | ||
91 | 101 | ||
92 | FUSE sets up the following hierarchy in sysfs: | 102 | mount -t fusectl none /sys/fs/fuse/connections |
93 | 103 | ||
94 | /sys/fs/fuse/connections/N/ | 104 | Mounting it under the '/sys/fs/fuse/connections' directory makes it |
105 | backwards compatible with earlier versions. | ||
95 | 106 | ||
96 | where N is an increasing number allocated to each new connection. | 107 | Under the fuse control filesystem each connection has a directory |
108 | named by a unique number. | ||
97 | 109 | ||
98 | For each connection the following attributes are defined: | 110 | For each connection the following files exist within this directory: |
99 | 111 | ||
100 | 'waiting' | 112 | 'waiting' |
101 | 113 | ||
@@ -110,7 +122,47 @@ For each connection the following attributes are defined: | |||
110 | connection. This means that all waiting requests will be aborted an | 122 | connection. This means that all waiting requests will be aborted an |
111 | error returned for all aborted and new requests. | 123 | error returned for all aborted and new requests. |
112 | 124 | ||
113 | Only a privileged user may read or write these attributes. | 125 | Only the owner of the mount may read or write these files. |
126 | |||
127 | Interrupting filesystem operations | ||
128 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
129 | |||
130 | If a process issuing a FUSE filesystem request is interrupted, the | ||
131 | following will happen: | ||
132 | |||
133 | 1) If the request is not yet sent to userspace AND the signal is | ||
134 | fatal (SIGKILL or unhandled fatal signal), then the request is | ||
135 | dequeued and returns immediately. | ||
136 | |||
137 | 2) If the request is not yet sent to userspace AND the signal is not | ||
138 | fatal, then an 'interrupted' flag is set for the request. When | ||
139 | the request has been successfully transfered to userspace and | ||
140 | this flag is set, an INTERRUPT request is queued. | ||
141 | |||
142 | 3) If the request is already sent to userspace, then an INTERRUPT | ||
143 | request is queued. | ||
144 | |||
145 | INTERRUPT requests take precedence over other requests, so the | ||
146 | userspace filesystem will receive queued INTERRUPTs before any others. | ||
147 | |||
148 | The userspace filesystem may ignore the INTERRUPT requests entirely, | ||
149 | or may honor them by sending a reply to the _original_ request, with | ||
150 | the error set to EINTR. | ||
151 | |||
152 | It is also possible that there's a race between processing the | ||
153 | original request and it's INTERRUPT request. There are two possibilities: | ||
154 | |||
155 | 1) The INTERRUPT request is processed before the original request is | ||
156 | processed | ||
157 | |||
158 | 2) The INTERRUPT request is processed after the original request has | ||
159 | been answered | ||
160 | |||
161 | If the filesystem cannot find the original request, it should wait for | ||
162 | some timeout and/or a number of new requests to arrive, after which it | ||
163 | should reply to the INTERRUPT request with an EAGAIN error. In case | ||
164 | 1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT | ||
165 | reply will be ignored. | ||
114 | 166 | ||
115 | Aborting a filesystem connection | 167 | Aborting a filesystem connection |
116 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 168 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
@@ -139,8 +191,8 @@ the filesystem. There are several ways to do this: | |||
139 | - Use forced umount (umount -f). Works in all cases but only if | 191 | - Use forced umount (umount -f). Works in all cases but only if |
140 | filesystem is still attached (it hasn't been lazy unmounted) | 192 | filesystem is still attached (it hasn't been lazy unmounted) |
141 | 193 | ||
142 | - Abort filesystem through the sysfs interface. Most powerful | 194 | - Abort filesystem through the FUSE control filesystem. Most |
143 | method, always works. | 195 | powerful method, always works. |
144 | 196 | ||
145 | How do non-privileged mounts work? | 197 | How do non-privileged mounts work? |
146 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 198 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
@@ -304,25 +356,7 @@ Scenario 1 - Simple deadlock | |||
304 | | | for "file"] | 356 | | | for "file"] |
305 | | | *DEADLOCK* | 357 | | | *DEADLOCK* |
306 | 358 | ||
307 | The solution for this is to allow requests to be interrupted while | 359 | The solution for this is to allow the filesystem to be aborted. |
308 | they are in userspace: | ||
309 | |||
310 | | [interrupted by signal] | | ||
311 | | <fuse_unlink() | | ||
312 | | [release semaphore] | [semaphore acquired] | ||
313 | | <sys_unlink() | | ||
314 | | | >fuse_unlink() | ||
315 | | | [queue req on fc->pending] | ||
316 | | | [wake up fc->waitq] | ||
317 | | | [sleep on req->waitq] | ||
318 | |||
319 | If the filesystem daemon was single threaded, this will stop here, | ||
320 | since there's no other thread to dequeue and execute the request. | ||
321 | In this case the solution is to kill the FUSE daemon as well. If | ||
322 | there are multiple serving threads, you just have to kill them as | ||
323 | long as any remain. | ||
324 | |||
325 | Moral: a filesystem which deadlocks, can soon find itself dead. | ||
326 | 360 | ||
327 | Scenario 2 - Tricky deadlock | 361 | Scenario 2 - Tricky deadlock |
328 | ---------------------------- | 362 | ---------------------------- |
@@ -355,24 +389,14 @@ but is caused by a pagefault. | |||
355 | | | [lock page] | 389 | | | [lock page] |
356 | | | * DEADLOCK * | 390 | | | * DEADLOCK * |
357 | 391 | ||
358 | Solution is again to let the the request be interrupted (not | 392 | Solution is basically the same as above. |
359 | elaborated further). | ||
360 | |||
361 | An additional problem is that while the write buffer is being | ||
362 | copied to the request, the request must not be interrupted. This | ||
363 | is because the destination address of the copy may not be valid | ||
364 | after the request is interrupted. | ||
365 | |||
366 | This is solved with doing the copy atomically, and allowing | ||
367 | interruption while the page(s) belonging to the write buffer are | ||
368 | faulted with get_user_pages(). The 'req->locked' flag indicates | ||
369 | when the copy is taking place, and interruption is delayed until | ||
370 | this flag is unset. | ||
371 | 393 | ||
372 | Scenario 3 - Tricky deadlock with asynchronous read | 394 | An additional problem is that while the write buffer is being copied |
373 | --------------------------------------------------- | 395 | to the request, the request must not be interrupted/aborted. This is |
396 | because the destination address of the copy may not be valid after the | ||
397 | request has returned. | ||
374 | 398 | ||
375 | The same situation as above, except thread-1 will wait on page lock | 399 | This is solved with doing the copy atomically, and allowing abort |
376 | and hence it will be uninterruptible as well. The solution is to | 400 | while the page(s) belonging to the write buffer are faulted with |
377 | abort the connection with forced umount (if mount is attached) or | 401 | get_user_pages(). The 'req->locked' flag indicates when the copy is |
378 | through the abort attribute in sysfs. | 402 | taking place, and abort is delayed until this flag is unset. |
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt index 60ab61e54e8..25981e2e51b 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt | |||
@@ -70,11 +70,13 @@ tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. | |||
70 | What is rootfs? | 70 | What is rootfs? |
71 | --------------- | 71 | --------------- |
72 | 72 | ||
73 | Rootfs is a special instance of ramfs, which is always present in 2.6 systems. | 73 | Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is |
74 | (It's used internally as the starting and stopping point for searches of the | 74 | always present in 2.6 systems. You can't unmount rootfs for approximately the |
75 | kernel's doubly-linked list of mount points.) | 75 | same reason you can't kill the init process; rather than having special code |
76 | to check for and handle an empty list, it's smaller and simpler for the kernel | ||
77 | to just make sure certain lists can't become empty. | ||
76 | 78 | ||
77 | Most systems just mount another filesystem over it and ignore it. The | 79 | Most systems just mount another filesystem over rootfs and ignore it. The |
78 | amount of space an empty instance of ramfs takes up is tiny. | 80 | amount of space an empty instance of ramfs takes up is tiny. |
79 | 81 | ||
80 | What is initramfs? | 82 | What is initramfs? |
@@ -92,14 +94,16 @@ out of that. | |||
92 | 94 | ||
93 | All this differs from the old initrd in several ways: | 95 | All this differs from the old initrd in several ways: |
94 | 96 | ||
95 | - The old initrd was a separate file, while the initramfs archive is linked | 97 | - The old initrd was always a separate file, while the initramfs archive is |
96 | into the linux kernel image. (The directory linux-*/usr is devoted to | 98 | linked into the linux kernel image. (The directory linux-*/usr is devoted |
97 | generating this archive during the build.) | 99 | to generating this archive during the build.) |
98 | 100 | ||
99 | - The old initrd file was a gzipped filesystem image (in some file format, | 101 | - The old initrd file was a gzipped filesystem image (in some file format, |
100 | such as ext2, that had to be built into the kernel), while the new | 102 | such as ext2, that needed a driver built into the kernel), while the new |
101 | initramfs archive is a gzipped cpio archive (like tar only simpler, | 103 | initramfs archive is a gzipped cpio archive (like tar only simpler, |
102 | see cpio(1) and Documentation/early-userspace/buffer-format.txt). | 104 | see cpio(1) and Documentation/early-userspace/buffer-format.txt). The |
105 | kernel's cpio extraction code is not only extremely small, it's also | ||
106 | __init data that can be discarded during the boot process. | ||
103 | 107 | ||
104 | - The program run by the old initrd (which was called /initrd, not /init) did | 108 | - The program run by the old initrd (which was called /initrd, not /init) did |
105 | some setup and then returned to the kernel, while the init program from | 109 | some setup and then returned to the kernel, while the init program from |
@@ -124,13 +128,14 @@ Populating initramfs: | |||
124 | 128 | ||
125 | The 2.6 kernel build process always creates a gzipped cpio format initramfs | 129 | The 2.6 kernel build process always creates a gzipped cpio format initramfs |
126 | archive and links it into the resulting kernel binary. By default, this | 130 | archive and links it into the resulting kernel binary. By default, this |
127 | archive is empty (consuming 134 bytes on x86). The config option | 131 | archive is empty (consuming 134 bytes on x86). |
128 | CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices | 132 | |
129 | in menuconfig, and living in usr/Kconfig) can be used to specify a source for | 133 | The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under |
130 | the initramfs archive, which will automatically be incorporated into the | 134 | devices->block devices in menuconfig, and living in usr/Kconfig) can be used |
131 | resulting binary. This option can point to an existing gzipped cpio archive, a | 135 | to specify a source for the initramfs archive, which will automatically be |
132 | directory containing files to be archived, or a text file specification such | 136 | incorporated into the resulting binary. This option can point to an existing |
133 | as the following example: | 137 | gzipped cpio archive, a directory containing files to be archived, or a text |
138 | file specification such as the following example: | ||
134 | 139 | ||
135 | dir /dev 755 0 0 | 140 | dir /dev 755 0 0 |
136 | nod /dev/console 644 0 0 c 5 1 | 141 | nod /dev/console 644 0 0 c 5 1 |
@@ -146,23 +151,84 @@ as the following example: | |||
146 | Run "usr/gen_init_cpio" (after the kernel build) to get a usage message | 151 | Run "usr/gen_init_cpio" (after the kernel build) to get a usage message |
147 | documenting the above file format. | 152 | documenting the above file format. |
148 | 153 | ||
149 | One advantage of the text file is that root access is not required to | 154 | One advantage of the configuration file is that root access is not required to |
150 | set permissions or create device nodes in the new archive. (Note that those | 155 | set permissions or create device nodes in the new archive. (Note that those |
151 | two example "file" entries expect to find files named "init.sh" and "busybox" in | 156 | two example "file" entries expect to find files named "init.sh" and "busybox" in |
152 | a directory called "initramfs", under the linux-2.6.* directory. See | 157 | a directory called "initramfs", under the linux-2.6.* directory. See |
153 | Documentation/early-userspace/README for more details.) | 158 | Documentation/early-userspace/README for more details.) |
154 | 159 | ||
155 | The kernel does not depend on external cpio tools, gen_init_cpio is created | 160 | The kernel does not depend on external cpio tools. If you specify a |
156 | from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's | 161 | directory instead of a configuration file, the kernel's build infrastructure |
157 | boot-time extractor is also (obviously) self-contained. However, if you _do_ | 162 | creates a configuration file from that directory (usr/Makefile calls |
158 | happen to have cpio installed, the following command line can extract the | 163 | scripts/gen_initramfs_list.sh), and proceeds to package up that directory |
159 | generated cpio image back into its component files: | 164 | using the config file (by feeding it to usr/gen_init_cpio, which is created |
165 | from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is | ||
166 | entirely self-contained, and the kernel's boot-time extractor is also | ||
167 | (obviously) self-contained. | ||
168 | |||
169 | The one thing you might need external cpio utilities installed for is creating | ||
170 | or extracting your own preprepared cpio files to feed to the kernel build | ||
171 | (instead of a config file or directory). | ||
172 | |||
173 | The following command line can extract a cpio image (either by the above script | ||
174 | or by the kernel build) back into its component files: | ||
160 | 175 | ||
161 | cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames | 176 | cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames |
162 | 177 | ||
178 | The following shell script can create a prebuilt cpio archive you can | ||
179 | use in place of the above config file: | ||
180 | |||
181 | #!/bin/sh | ||
182 | |||
183 | # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation. | ||
184 | # Licensed under GPL version 2 | ||
185 | |||
186 | if [ $# -ne 2 ] | ||
187 | then | ||
188 | echo "usage: mkinitramfs directory imagename.cpio.gz" | ||
189 | exit 1 | ||
190 | fi | ||
191 | |||
192 | if [ -d "$1" ] | ||
193 | then | ||
194 | echo "creating $2 from $1" | ||
195 | (cd "$1"; find . | cpio -o -H newc | gzip) > "$2" | ||
196 | else | ||
197 | echo "First argument must be a directory" | ||
198 | exit 1 | ||
199 | fi | ||
200 | |||
201 | Note: The cpio man page contains some bad advice that will break your initramfs | ||
202 | archive if you follow it. It says "A typical way to generate the list | ||
203 | of filenames is with the find command; you should give find the -depth option | ||
204 | to minimize problems with permissions on directories that are unwritable or not | ||
205 | searchable." Don't do this when creating initramfs.cpio.gz images, it won't | ||
206 | work. The Linux kernel cpio extractor won't create files in a directory that | ||
207 | doesn't exist, so the directory entries must go before the files that go in | ||
208 | those directories. The above script gets them in the right order. | ||
209 | |||
210 | External initramfs images: | ||
211 | -------------------------- | ||
212 | |||
213 | If the kernel has initrd support enabled, an external cpio.gz archive can also | ||
214 | be passed into a 2.6 kernel in place of an initrd. In this case, the kernel | ||
215 | will autodetect the type (initramfs, not initrd) and extract the external cpio | ||
216 | archive into rootfs before trying to run /init. | ||
217 | |||
218 | This has the memory efficiency advantages of initramfs (no ramdisk block | ||
219 | device) but the separate packaging of initrd (which is nice if you have | ||
220 | non-GPL code you'd like to run from initramfs, without conflating it with | ||
221 | the GPL licensed Linux kernel binary). | ||
222 | |||
223 | It can also be used to supplement the kernel's built-in initamfs image. The | ||
224 | files in the external archive will overwrite any conflicting files in | ||
225 | the built-in initramfs archive. Some distributors also prefer to customize | ||
226 | a single kernel image with task-specific initramfs images, without recompiling. | ||
227 | |||
163 | Contents of initramfs: | 228 | Contents of initramfs: |
164 | ---------------------- | 229 | ---------------------- |
165 | 230 | ||
231 | An initramfs archive is a complete self-contained root filesystem for Linux. | ||
166 | If you don't already understand what shared libraries, devices, and paths | 232 | If you don't already understand what shared libraries, devices, and paths |
167 | you need to get a minimal root filesystem up and running, here are some | 233 | you need to get a minimal root filesystem up and running, here are some |
168 | references: | 234 | references: |
@@ -176,13 +242,36 @@ code against, along with some related utilities. It is BSD licensed. | |||
176 | 242 | ||
177 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) | 243 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) |
178 | myself. These are LGPL and GPL, respectively. (A self-contained initramfs | 244 | myself. These are LGPL and GPL, respectively. (A self-contained initramfs |
179 | package is planned for the busybox 1.2 release.) | 245 | package is planned for the busybox 1.3 release.) |
180 | 246 | ||
181 | In theory you could use glibc, but that's not well suited for small embedded | 247 | In theory you could use glibc, but that's not well suited for small embedded |
182 | uses like this. (A "hello world" program statically linked against glibc is | 248 | uses like this. (A "hello world" program statically linked against glibc is |
183 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do | 249 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do |
184 | name lookups, even when otherwise statically linked.) | 250 | name lookups, even when otherwise statically linked.) |
185 | 251 | ||
252 | A good first step is to get initramfs to run a statically linked "hello world" | ||
253 | program as init, and test it under an emulator like qemu (www.qemu.org) or | ||
254 | User Mode Linux, like so: | ||
255 | |||
256 | cat > hello.c << EOF | ||
257 | #include <stdio.h> | ||
258 | #include <unistd.h> | ||
259 | |||
260 | int main(int argc, char *argv[]) | ||
261 | { | ||
262 | printf("Hello world!\n"); | ||
263 | sleep(999999999); | ||
264 | } | ||
265 | EOF | ||
266 | gcc -static hello2.c -o init | ||
267 | echo init | cpio -o -H newc | gzip > test.cpio.gz | ||
268 | # Testing external initramfs using the initrd loading mechanism. | ||
269 | qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero | ||
270 | |||
271 | When debugging a normal root filesystem, it's nice to be able to boot with | ||
272 | "init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's | ||
273 | just as useful. | ||
274 | |||
186 | Why cpio rather than tar? | 275 | Why cpio rather than tar? |
187 | ------------------------- | 276 | ------------------------- |
188 | 277 | ||
@@ -241,7 +330,7 @@ the above threads) is: | |||
241 | Future directions: | 330 | Future directions: |
242 | ------------------ | 331 | ------------------ |
243 | 332 | ||
244 | Today (2.6.14), initramfs is always compiled in, but not always used. The | 333 | Today (2.6.16), initramfs is always compiled in, but not always used. The |
245 | kernel falls back to legacy boot code that is reached only if initramfs does | 334 | kernel falls back to legacy boot code that is reached only if initramfs does |
246 | not contain an /init program. The fallback is legacy code, there to ensure a | 335 | not contain an /init program. The fallback is legacy code, there to ensure a |
247 | smooth transition and allowing early boot functionality to gradually move to | 336 | smooth transition and allowing early boot functionality to gradually move to |
@@ -258,8 +347,9 @@ and so on. | |||
258 | 347 | ||
259 | This kind of complexity (which inevitably includes policy) is rightly handled | 348 | This kind of complexity (which inevitably includes policy) is rightly handled |
260 | in userspace. Both klibc and busybox/uClibc are working on simple initramfs | 349 | in userspace. Both klibc and busybox/uClibc are working on simple initramfs |
261 | packages to drop into a kernel build, and when standard solutions are ready | 350 | packages to drop into a kernel build. |
262 | and widely deployed, the kernel's legacy early boot code will become obsolete | ||
263 | and a candidate for the feature removal schedule. | ||
264 | 351 | ||
265 | But that's a while off yet. | 352 | The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree. |
353 | The kernel's current early boot code (partition detection, etc) will probably | ||
354 | be migrated into a default initramfs, automatically created and used by the | ||
355 | kernel build. | ||