aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/fuse.txt118
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.txt146
2 files changed, 189 insertions, 75 deletions
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
index 33f74310d161..a584f05403a4 100644
--- a/Documentation/filesystems/fuse.txt
+++ b/Documentation/filesystems/fuse.txt
@@ -18,6 +18,14 @@ Non-privileged mount (or user mount):
18 user. NOTE: this is not the same as mounts allowed with the "user" 18 user. NOTE: this is not the same as mounts allowed with the "user"
19 option in /etc/fstab, which is not discussed here. 19 option in /etc/fstab, which is not discussed here.
20 20
21Filesystem connection:
22
23 A connection between the filesystem daemon and the kernel. The
24 connection exists until either the daemon dies, or the filesystem is
25 umounted. Note that detaching (or lazy umounting) the filesystem
26 does _not_ break the connection, in this case it will exist until
27 the last reference to the filesystem is released.
28
21Mount owner: 29Mount owner:
22 30
23 The user who does the mounting. 31 The user who does the mounting.
@@ -86,16 +94,20 @@ Mount options
86 The default is infinite. Note that the size of read requests is 94 The default is infinite. Note that the size of read requests is
87 limited anyway to 32 pages (which is 128kbyte on i386). 95 limited anyway to 32 pages (which is 128kbyte on i386).
88 96
89Sysfs 97Control filesystem
90~~~~~ 98~~~~~~~~~~~~~~~~~~
99
100There's a control filesystem for FUSE, which can be mounted by:
91 101
92FUSE sets up the following hierarchy in sysfs: 102 mount -t fusectl none /sys/fs/fuse/connections
93 103
94 /sys/fs/fuse/connections/N/ 104Mounting it under the '/sys/fs/fuse/connections' directory makes it
105backwards compatible with earlier versions.
95 106
96where N is an increasing number allocated to each new connection. 107Under the fuse control filesystem each connection has a directory
108named by a unique number.
97 109
98For each connection the following attributes are defined: 110For each connection the following files exist within this directory:
99 111
100 'waiting' 112 'waiting'
101 113
@@ -110,7 +122,47 @@ For each connection the following attributes are defined:
110 connection. This means that all waiting requests will be aborted an 122 connection. This means that all waiting requests will be aborted an
111 error returned for all aborted and new requests. 123 error returned for all aborted and new requests.
112 124
113Only a privileged user may read or write these attributes. 125Only the owner of the mount may read or write these files.
126
127Interrupting filesystem operations
128~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
129
130If a process issuing a FUSE filesystem request is interrupted, the
131following will happen:
132
133 1) If the request is not yet sent to userspace AND the signal is
134 fatal (SIGKILL or unhandled fatal signal), then the request is
135 dequeued and returns immediately.
136
137 2) If the request is not yet sent to userspace AND the signal is not
138 fatal, then an 'interrupted' flag is set for the request. When
139 the request has been successfully transfered to userspace and
140 this flag is set, an INTERRUPT request is queued.
141
142 3) If the request is already sent to userspace, then an INTERRUPT
143 request is queued.
144
145INTERRUPT requests take precedence over other requests, so the
146userspace filesystem will receive queued INTERRUPTs before any others.
147
148The userspace filesystem may ignore the INTERRUPT requests entirely,
149or may honor them by sending a reply to the _original_ request, with
150the error set to EINTR.
151
152It is also possible that there's a race between processing the
153original request and it's INTERRUPT request. There are two possibilities:
154
155 1) The INTERRUPT request is processed before the original request is
156 processed
157
158 2) The INTERRUPT request is processed after the original request has
159 been answered
160
161If the filesystem cannot find the original request, it should wait for
162some timeout and/or a number of new requests to arrive, after which it
163should reply to the INTERRUPT request with an EAGAIN error. In case
1641) the INTERRUPT request will be requeued. In case 2) the INTERRUPT
165reply will be ignored.
114 166
115Aborting a filesystem connection 167Aborting a filesystem connection
116~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 168~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -139,8 +191,8 @@ the filesystem. There are several ways to do this:
139 - Use forced umount (umount -f). Works in all cases but only if 191 - Use forced umount (umount -f). Works in all cases but only if
140 filesystem is still attached (it hasn't been lazy unmounted) 192 filesystem is still attached (it hasn't been lazy unmounted)
141 193
142 - Abort filesystem through the sysfs interface. Most powerful 194 - Abort filesystem through the FUSE control filesystem. Most
143 method, always works. 195 powerful method, always works.
144 196
145How do non-privileged mounts work? 197How do non-privileged mounts work?
146~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 198~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -304,25 +356,7 @@ Scenario 1 - Simple deadlock
304 | | for "file"] 356 | | for "file"]
305 | | *DEADLOCK* 357 | | *DEADLOCK*
306 358
307The solution for this is to allow requests to be interrupted while 359The solution for this is to allow the filesystem to be aborted.
308they are in userspace:
309
310 | [interrupted by signal] |
311 | <fuse_unlink() |
312 | [release semaphore] | [semaphore acquired]
313 | <sys_unlink() |
314 | | >fuse_unlink()
315 | | [queue req on fc->pending]
316 | | [wake up fc->waitq]
317 | | [sleep on req->waitq]
318
319If the filesystem daemon was single threaded, this will stop here,
320since there's no other thread to dequeue and execute the request.
321In this case the solution is to kill the FUSE daemon as well. If
322there are multiple serving threads, you just have to kill them as
323long as any remain.
324
325Moral: a filesystem which deadlocks, can soon find itself dead.
326 360
327Scenario 2 - Tricky deadlock 361Scenario 2 - Tricky deadlock
328---------------------------- 362----------------------------
@@ -355,24 +389,14 @@ but is caused by a pagefault.
355 | | [lock page] 389 | | [lock page]
356 | | * DEADLOCK * 390 | | * DEADLOCK *
357 391
358Solution is again to let the the request be interrupted (not 392Solution is basically the same as above.
359elaborated further).
360
361An additional problem is that while the write buffer is being
362copied to the request, the request must not be interrupted. This
363is because the destination address of the copy may not be valid
364after the request is interrupted.
365
366This is solved with doing the copy atomically, and allowing
367interruption while the page(s) belonging to the write buffer are
368faulted with get_user_pages(). The 'req->locked' flag indicates
369when the copy is taking place, and interruption is delayed until
370this flag is unset.
371 393
372Scenario 3 - Tricky deadlock with asynchronous read 394An additional problem is that while the write buffer is being copied
373--------------------------------------------------- 395to the request, the request must not be interrupted/aborted. This is
396because the destination address of the copy may not be valid after the
397request has returned.
374 398
375The same situation as above, except thread-1 will wait on page lock 399This is solved with doing the copy atomically, and allowing abort
376and hence it will be uninterruptible as well. The solution is to 400while the page(s) belonging to the write buffer are faulted with
377abort the connection with forced umount (if mount is attached) or 401get_user_pages(). The 'req->locked' flag indicates when the copy is
378through the abort attribute in sysfs. 402taking place, and abort is delayed until this flag is unset.
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
index 60ab61e54e8a..25981e2e51be 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -70,11 +70,13 @@ tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
70What is rootfs? 70What is rootfs?
71--------------- 71---------------
72 72
73Rootfs is a special instance of ramfs, which is always present in 2.6 systems. 73Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
74(It's used internally as the starting and stopping point for searches of the 74always present in 2.6 systems. You can't unmount rootfs for approximately the
75kernel's doubly-linked list of mount points.) 75same reason you can't kill the init process; rather than having special code
76to check for and handle an empty list, it's smaller and simpler for the kernel
77to just make sure certain lists can't become empty.
76 78
77Most systems just mount another filesystem over it and ignore it. The 79Most systems just mount another filesystem over rootfs and ignore it. The
78amount of space an empty instance of ramfs takes up is tiny. 80amount of space an empty instance of ramfs takes up is tiny.
79 81
80What is initramfs? 82What is initramfs?
@@ -92,14 +94,16 @@ out of that.
92 94
93All this differs from the old initrd in several ways: 95All this differs from the old initrd in several ways:
94 96
95 - The old initrd was a separate file, while the initramfs archive is linked 97 - The old initrd was always a separate file, while the initramfs archive is
96 into the linux kernel image. (The directory linux-*/usr is devoted to 98 linked into the linux kernel image. (The directory linux-*/usr is devoted
97 generating this archive during the build.) 99 to generating this archive during the build.)
98 100
99 - The old initrd file was a gzipped filesystem image (in some file format, 101 - The old initrd file was a gzipped filesystem image (in some file format,
100 such as ext2, that had to be built into the kernel), while the new 102 such as ext2, that needed a driver built into the kernel), while the new
101 initramfs archive is a gzipped cpio archive (like tar only simpler, 103 initramfs archive is a gzipped cpio archive (like tar only simpler,
102 see cpio(1) and Documentation/early-userspace/buffer-format.txt). 104 see cpio(1) and Documentation/early-userspace/buffer-format.txt). The
105 kernel's cpio extraction code is not only extremely small, it's also
106 __init data that can be discarded during the boot process.
103 107
104 - The program run by the old initrd (which was called /initrd, not /init) did 108 - The program run by the old initrd (which was called /initrd, not /init) did
105 some setup and then returned to the kernel, while the init program from 109 some setup and then returned to the kernel, while the init program from
@@ -124,13 +128,14 @@ Populating initramfs:
124 128
125The 2.6 kernel build process always creates a gzipped cpio format initramfs 129The 2.6 kernel build process always creates a gzipped cpio format initramfs
126archive and links it into the resulting kernel binary. By default, this 130archive and links it into the resulting kernel binary. By default, this
127archive is empty (consuming 134 bytes on x86). The config option 131archive is empty (consuming 134 bytes on x86).
128CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices 132
129in menuconfig, and living in usr/Kconfig) can be used to specify a source for 133The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under
130the initramfs archive, which will automatically be incorporated into the 134devices->block devices in menuconfig, and living in usr/Kconfig) can be used
131resulting binary. This option can point to an existing gzipped cpio archive, a 135to specify a source for the initramfs archive, which will automatically be
132directory containing files to be archived, or a text file specification such 136incorporated into the resulting binary. This option can point to an existing
133as the following example: 137gzipped cpio archive, a directory containing files to be archived, or a text
138file specification such as the following example:
134 139
135 dir /dev 755 0 0 140 dir /dev 755 0 0
136 nod /dev/console 644 0 0 c 5 1 141 nod /dev/console 644 0 0 c 5 1
@@ -146,23 +151,84 @@ as the following example:
146Run "usr/gen_init_cpio" (after the kernel build) to get a usage message 151Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
147documenting the above file format. 152documenting the above file format.
148 153
149One advantage of the text file is that root access is not required to 154One advantage of the configuration file is that root access is not required to
150set permissions or create device nodes in the new archive. (Note that those 155set permissions or create device nodes in the new archive. (Note that those
151two example "file" entries expect to find files named "init.sh" and "busybox" in 156two example "file" entries expect to find files named "init.sh" and "busybox" in
152a directory called "initramfs", under the linux-2.6.* directory. See 157a directory called "initramfs", under the linux-2.6.* directory. See
153Documentation/early-userspace/README for more details.) 158Documentation/early-userspace/README for more details.)
154 159
155The kernel does not depend on external cpio tools, gen_init_cpio is created 160The kernel does not depend on external cpio tools. If you specify a
156from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's 161directory instead of a configuration file, the kernel's build infrastructure
157boot-time extractor is also (obviously) self-contained. However, if you _do_ 162creates a configuration file from that directory (usr/Makefile calls
158happen to have cpio installed, the following command line can extract the 163scripts/gen_initramfs_list.sh), and proceeds to package up that directory
159generated cpio image back into its component files: 164using the config file (by feeding it to usr/gen_init_cpio, which is created
165from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
166entirely self-contained, and the kernel's boot-time extractor is also
167(obviously) self-contained.
168
169The one thing you might need external cpio utilities installed for is creating
170or extracting your own preprepared cpio files to feed to the kernel build
171(instead of a config file or directory).
172
173The following command line can extract a cpio image (either by the above script
174or by the kernel build) back into its component files:
160 175
161 cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames 176 cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
162 177
178The following shell script can create a prebuilt cpio archive you can
179use in place of the above config file:
180
181 #!/bin/sh
182
183 # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
184 # Licensed under GPL version 2
185
186 if [ $# -ne 2 ]
187 then
188 echo "usage: mkinitramfs directory imagename.cpio.gz"
189 exit 1
190 fi
191
192 if [ -d "$1" ]
193 then
194 echo "creating $2 from $1"
195 (cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
196 else
197 echo "First argument must be a directory"
198 exit 1
199 fi
200
201Note: The cpio man page contains some bad advice that will break your initramfs
202archive if you follow it. It says "A typical way to generate the list
203of filenames is with the find command; you should give find the -depth option
204to minimize problems with permissions on directories that are unwritable or not
205searchable." Don't do this when creating initramfs.cpio.gz images, it won't
206work. The Linux kernel cpio extractor won't create files in a directory that
207doesn't exist, so the directory entries must go before the files that go in
208those directories. The above script gets them in the right order.
209
210External initramfs images:
211--------------------------
212
213If the kernel has initrd support enabled, an external cpio.gz archive can also
214be passed into a 2.6 kernel in place of an initrd. In this case, the kernel
215will autodetect the type (initramfs, not initrd) and extract the external cpio
216archive into rootfs before trying to run /init.
217
218This has the memory efficiency advantages of initramfs (no ramdisk block
219device) but the separate packaging of initrd (which is nice if you have
220non-GPL code you'd like to run from initramfs, without conflating it with
221the GPL licensed Linux kernel binary).
222
223It can also be used to supplement the kernel's built-in initamfs image. The
224files in the external archive will overwrite any conflicting files in
225the built-in initramfs archive. Some distributors also prefer to customize
226a single kernel image with task-specific initramfs images, without recompiling.
227
163Contents of initramfs: 228Contents of initramfs:
164---------------------- 229----------------------
165 230
231An initramfs archive is a complete self-contained root filesystem for Linux.
166If you don't already understand what shared libraries, devices, and paths 232If you don't already understand what shared libraries, devices, and paths
167you need to get a minimal root filesystem up and running, here are some 233you need to get a minimal root filesystem up and running, here are some
168references: 234references:
@@ -176,13 +242,36 @@ code against, along with some related utilities. It is BSD licensed.
176 242
177I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) 243I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
178myself. These are LGPL and GPL, respectively. (A self-contained initramfs 244myself. These are LGPL and GPL, respectively. (A self-contained initramfs
179package is planned for the busybox 1.2 release.) 245package is planned for the busybox 1.3 release.)
180 246
181In theory you could use glibc, but that's not well suited for small embedded 247In theory you could use glibc, but that's not well suited for small embedded
182uses like this. (A "hello world" program statically linked against glibc is 248uses like this. (A "hello world" program statically linked against glibc is
183over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do 249over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
184name lookups, even when otherwise statically linked.) 250name lookups, even when otherwise statically linked.)
185 251
252A good first step is to get initramfs to run a statically linked "hello world"
253program as init, and test it under an emulator like qemu (www.qemu.org) or
254User Mode Linux, like so:
255
256 cat > hello.c << EOF
257 #include <stdio.h>
258 #include <unistd.h>
259
260 int main(int argc, char *argv[])
261 {
262 printf("Hello world!\n");
263 sleep(999999999);
264 }
265 EOF
266 gcc -static hello2.c -o init
267 echo init | cpio -o -H newc | gzip > test.cpio.gz
268 # Testing external initramfs using the initrd loading mechanism.
269 qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
270
271When debugging a normal root filesystem, it's nice to be able to boot with
272"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's
273just as useful.
274
186Why cpio rather than tar? 275Why cpio rather than tar?
187------------------------- 276-------------------------
188 277
@@ -241,7 +330,7 @@ the above threads) is:
241Future directions: 330Future directions:
242------------------ 331------------------
243 332
244Today (2.6.14), initramfs is always compiled in, but not always used. The 333Today (2.6.16), initramfs is always compiled in, but not always used. The
245kernel falls back to legacy boot code that is reached only if initramfs does 334kernel falls back to legacy boot code that is reached only if initramfs does
246not contain an /init program. The fallback is legacy code, there to ensure a 335not contain an /init program. The fallback is legacy code, there to ensure a
247smooth transition and allowing early boot functionality to gradually move to 336smooth transition and allowing early boot functionality to gradually move to
@@ -258,8 +347,9 @@ and so on.
258 347
259This kind of complexity (which inevitably includes policy) is rightly handled 348This kind of complexity (which inevitably includes policy) is rightly handled
260in userspace. Both klibc and busybox/uClibc are working on simple initramfs 349in userspace. Both klibc and busybox/uClibc are working on simple initramfs
261packages to drop into a kernel build, and when standard solutions are ready 350packages to drop into a kernel build.
262and widely deployed, the kernel's legacy early boot code will become obsolete
263and a candidate for the feature removal schedule.
264 351
265But that's a while off yet. 352The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
353The kernel's current early boot code (partition detection, etc) will probably
354be migrated into a default initramfs, automatically created and used by the
355kernel build.