diff options
author | Rob Landley <rob@landley.net> | 2005-11-07 04:01:09 -0500 |
---|---|---|
committer | Linus Torvalds <torvalds@g5.osdl.org> | 2005-11-07 10:53:56 -0500 |
commit | 7f46a240b0a1797eb641c046d445f026563463d4 (patch) | |
tree | f45cc4bccb355b147b2bbf8b0c329b71466aeed5 | |
parent | cbf8f0f36a2339f87b9dabbbd301ffd86744620c (diff) |
[PATCH] ramfs, rootfs, and initramfs docs
Docs for ramfs, rootfs, and initramfs.
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-rw-r--r-- | Documentation/filesystems/ramfs-rootfs-initramfs.txt | 195 |
1 files changed, 195 insertions, 0 deletions
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt new file mode 100644 index 000000000000..b3404a032596 --- /dev/null +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt | |||
@@ -0,0 +1,195 @@ | |||
1 | ramfs, rootfs and initramfs | ||
2 | October 17, 2005 | ||
3 | Rob Landley <rob@landley.net> | ||
4 | ============================= | ||
5 | |||
6 | What is ramfs? | ||
7 | -------------- | ||
8 | |||
9 | Ramfs is a very simple filesystem that exports Linux's disk caching | ||
10 | mechanisms (the page cache and dentry cache) as a dynamically resizable | ||
11 | ram-based filesystem. | ||
12 | |||
13 | Normally all files are cached in memory by Linux. Pages of data read from | ||
14 | backing store (usually the block device the filesystem is mounted on) are kept | ||
15 | around in case it's needed again, but marked as clean (freeable) in case the | ||
16 | Virtual Memory system needs the memory for something else. Similarly, data | ||
17 | written to files is marked clean as soon as it has been written to backing | ||
18 | store, but kept around for caching purposes until the VM reallocates the | ||
19 | memory. A similar mechanism (the dentry cache) greatly speeds up access to | ||
20 | directories. | ||
21 | |||
22 | With ramfs, there is no backing store. Files written into ramfs allocate | ||
23 | dentries and page cache as usual, but there's nowhere to write them to. | ||
24 | This means the pages are never marked clean, so they can't be freed by the | ||
25 | VM when it's looking to recycle memory. | ||
26 | |||
27 | The amount of code required to implement ramfs is tiny, because all the | ||
28 | work is done by the existing Linux caching infrastructure. Basically, | ||
29 | you're mounting the disk cache as a filesystem. Because of this, ramfs is not | ||
30 | an optional component removable via menuconfig, since there would be negligible | ||
31 | space savings. | ||
32 | |||
33 | ramfs and ramdisk: | ||
34 | ------------------ | ||
35 | |||
36 | The older "ram disk" mechanism created a synthetic block device out of | ||
37 | an area of ram and used it as backing store for a filesystem. This block | ||
38 | device was of fixed size, so the filesystem mounted on it was of fixed | ||
39 | size. Using a ram disk also required unnecessarily copying memory from the | ||
40 | fake block device into the page cache (and copying changes back out), as well | ||
41 | as creating and destroying dentries. Plus it needed a filesystem driver | ||
42 | (such as ext2) to format and interpret this data. | ||
43 | |||
44 | Compared to ramfs, this wastes memory (and memory bus bandwidth), creates | ||
45 | unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks | ||
46 | to avoid this copying by playing with the page tables, but they're unpleasantly | ||
47 | complicated and turn out to be about as expensive as the copying anyway.) | ||
48 | More to the point, all the work ramfs is doing has to happen _anyway_, | ||
49 | since all file access goes through the page and dentry caches. The ram | ||
50 | disk is simply unnecessary, ramfs is internally much simpler. | ||
51 | |||
52 | Another reason ramdisks are semi-obsolete is that the introduction of | ||
53 | loopback devices offered a more flexible and convenient way to create | ||
54 | synthetic block devices, now from files instead of from chunks of memory. | ||
55 | See losetup (8) for details. | ||
56 | |||
57 | ramfs and tmpfs: | ||
58 | ---------------- | ||
59 | |||
60 | One downside of ramfs is you can keep writing data into it until you fill | ||
61 | up all memory, and the VM can't free it because the VM thinks that files | ||
62 | should get written to backing store (rather than swap space), but ramfs hasn't | ||
63 | got any backing store. Because of this, only root (or a trusted user) should | ||
64 | be allowed write access to a ramfs mount. | ||
65 | |||
66 | A ramfs derivative called tmpfs was created to add size limits, and the ability | ||
67 | to write the data to swap space. Normal users can be allowed write access to | ||
68 | tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. | ||
69 | |||
70 | What is rootfs? | ||
71 | --------------- | ||
72 | |||
73 | Rootfs is a special instance of ramfs, which is always present in 2.6 systems. | ||
74 | (It's used internally as the starting and stopping point for searches of the | ||
75 | kernel's doubly-linked list of mount points.) | ||
76 | |||
77 | Most systems just mount another filesystem over it and ignore it. The | ||
78 | amount of space an empty instance of ramfs takes up is tiny. | ||
79 | |||
80 | What is initramfs? | ||
81 | ------------------ | ||
82 | |||
83 | All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is | ||
84 | extracted into rootfs when the kernel boots up. After extracting, the kernel | ||
85 | checks to see if rootfs contains a file "init", and if so it executes it as PID | ||
86 | 1. If found, this init process is responsible for bringing the system the | ||
87 | rest of the way up, including locating and mounting the real root device (if | ||
88 | any). If rootfs does not contain an init program after the embedded cpio | ||
89 | archive is extracted into it, the kernel will fall through to the older code | ||
90 | to locate and mount a root partition, then exec some variant of /sbin/init | ||
91 | out of that. | ||
92 | |||
93 | All this differs from the old initrd in several ways: | ||
94 | |||
95 | - The old initrd was a separate file, while the initramfs archive is linked | ||
96 | into the linux kernel image. (The directory linux-*/usr is devoted to | ||
97 | generating this archive during the build.) | ||
98 | |||
99 | - The old initrd file was a gzipped filesystem image (in some file format, | ||
100 | such as ext2, that had to be built into the kernel), while the new | ||
101 | initramfs archive is a gzipped cpio archive (like tar only simpler, | ||
102 | see cpio(1) and Documentation/early-userspace/buffer-format.txt). | ||
103 | |||
104 | - The program run by the old initrd (which was called /initrd, not /init) did | ||
105 | some setup and then returned to the kernel, while the init program from | ||
106 | initramfs is not expected to return to the kernel. (If /init needs to hand | ||
107 | off control it can overmount / with a new root device and exec another init | ||
108 | program. See the switch_root utility, below.) | ||
109 | |||
110 | - When switching another root device, initrd would pivot_root and then | ||
111 | umount the ramdisk. But initramfs is rootfs: you can neither pivot_root | ||
112 | rootfs, nor unmount it. Instead delete everything out of rootfs to | ||
113 | free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs | ||
114 | with the new root (cd /newmount; mount --move . /; chroot .), attach | ||
115 | stdin/stdout/stderr to the new /dev/console, and exec the new init. | ||
116 | |||
117 | Since this is a remarkably persnickity process (and involves deleting | ||
118 | commands before you can run them), the klibc package introduced a helper | ||
119 | program (utils/run_init.c) to do all this for you. Most other packages | ||
120 | (such as busybox) have named this command "switch_root". | ||
121 | |||
122 | Populating initramfs: | ||
123 | --------------------- | ||
124 | |||
125 | The 2.6 kernel build process always creates a gzipped cpio format initramfs | ||
126 | archive and links it into the resulting kernel binary. By default, this | ||
127 | archive is empty (consuming 134 bytes on x86). The config option | ||
128 | CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices | ||
129 | in menuconfig, and living in usr/Kconfig) can be used to specify a source for | ||
130 | the initramfs archive, which will automatically be incorporated into the | ||
131 | resulting binary. This option can point to an existing gzipped cpio archive, a | ||
132 | directory containing files to be archived, or a text file specification such | ||
133 | as the following example: | ||
134 | |||
135 | dir /dev 755 0 0 | ||
136 | nod /dev/console 644 0 0 c 5 1 | ||
137 | nod /dev/loop0 644 0 0 b 7 0 | ||
138 | dir /bin 755 1000 1000 | ||
139 | slink /bin/sh busybox 777 0 0 | ||
140 | file /bin/busybox initramfs/busybox 755 0 0 | ||
141 | dir /proc 755 0 0 | ||
142 | dir /sys 755 0 0 | ||
143 | dir /mnt 755 0 0 | ||
144 | file /init initramfs/init.sh 755 0 0 | ||
145 | |||
146 | One advantage of the text file is that root access is not required to | ||
147 | set permissions or create device nodes in the new archive. (Note that those | ||
148 | two example "file" entries expect to find files named "init.sh" and "busybox" in | ||
149 | a directory called "initramfs", under the linux-2.6.* directory. See | ||
150 | Documentation/early-userspace/README for more details.) | ||
151 | |||
152 | If you don't already understand what shared libraries, devices, and paths | ||
153 | you need to get a minimal root filesystem up and running, here are some | ||
154 | references: | ||
155 | http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ | ||
156 | http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html | ||
157 | http://www.linuxfromscratch.org/lfs/view/stable/ | ||
158 | |||
159 | The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is | ||
160 | designed to be a tiny C library to statically link early userspace | ||
161 | code against, along with some related utilities. It is BSD licensed. | ||
162 | |||
163 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) | ||
164 | myself. These are LGPL and GPL, respectively. | ||
165 | |||
166 | In theory you could use glibc, but that's not well suited for small embedded | ||
167 | uses like this. (A "hello world" program statically linked against glibc is | ||
168 | over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do | ||
169 | name lookups, even when otherwise statically linked.) | ||
170 | |||
171 | Future directions: | ||
172 | ------------------ | ||
173 | |||
174 | Today (2.6.14), initramfs is always compiled in, but not always used. The | ||
175 | kernel falls back to legacy boot code that is reached only if initramfs does | ||
176 | not contain an /init program. The fallback is legacy code, there to ensure a | ||
177 | smooth transition and allowing early boot functionality to gradually move to | ||
178 | "early userspace" (I.E. initramfs). | ||
179 | |||
180 | The move to early userspace is necessary because finding and mounting the real | ||
181 | root device is complex. Root partitions can span multiple devices (raid or | ||
182 | separate journal). They can be out on the network (requiring dhcp, setting a | ||
183 | specific mac address, logging into a server, etc). They can live on removable | ||
184 | media, with dynamically allocated major/minor numbers and persistent naming | ||
185 | issues requiring a full udev implementation to sort out. They can be | ||
186 | compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, | ||
187 | and so on. | ||
188 | |||
189 | This kind of complexity (which inevitably includes policy) is rightly handled | ||
190 | in userspace. Both klibc and busybox/uClibc are working on simple initramfs | ||
191 | packages to drop into a kernel build, and when standard solutions are ready | ||
192 | and widely deployed, the kernel's legacy early boot code will become obsolete | ||
193 | and a candidate for the feature removal schedule. | ||
194 | |||
195 | But that's a while off yet. | ||