diff options
author | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
commit | 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch) | |
tree | 0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/filesystems/ext2.txt |
Linux-2.6.12-rc2v2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!
Diffstat (limited to 'Documentation/filesystems/ext2.txt')
-rw-r--r-- | Documentation/filesystems/ext2.txt | 383 |
1 files changed, 383 insertions, 0 deletions
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt new file mode 100644 index 000000000000..b5cb9110cc6b --- /dev/null +++ b/Documentation/filesystems/ext2.txt | |||
@@ -0,0 +1,383 @@ | |||
1 | |||
2 | The Second Extended Filesystem | ||
3 | ============================== | ||
4 | |||
5 | ext2 was originally released in January 1993. Written by R\'emy Card, | ||
6 | Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the | ||
7 | Extended Filesystem. It is currently still (April 2001) the predominant | ||
8 | filesystem in use by Linux. There are also implementations available | ||
9 | for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. | ||
10 | |||
11 | Options | ||
12 | ======= | ||
13 | |||
14 | Most defaults are determined by the filesystem superblock, and can be | ||
15 | set using tune2fs(8). Kernel-determined defaults are indicated by (*). | ||
16 | |||
17 | bsddf (*) Makes `df' act like BSD. | ||
18 | minixdf Makes `df' act like Minix. | ||
19 | |||
20 | check Check block and inode bitmaps at mount time | ||
21 | (requires CONFIG_EXT2_CHECK). | ||
22 | check=none, nocheck (*) Don't do extra checking of bitmaps on mount | ||
23 | (check=normal and check=strict options removed) | ||
24 | |||
25 | debug Extra debugging information is sent to the | ||
26 | kernel syslog. Useful for developers. | ||
27 | |||
28 | errors=continue Keep going on a filesystem error. | ||
29 | errors=remount-ro Remount the filesystem read-only on an error. | ||
30 | errors=panic Panic and halt the machine if an error occurs. | ||
31 | |||
32 | grpid, bsdgroups Give objects the same group ID as their parent. | ||
33 | nogrpid, sysvgroups New objects have the group ID of their creator. | ||
34 | |||
35 | nouid32 Use 16-bit UIDs and GIDs. | ||
36 | |||
37 | oldalloc Enable the old block allocator. Orlov should | ||
38 | have better performance, we'd like to get some | ||
39 | feedback if it's the contrary for you. | ||
40 | orlov (*) Use the Orlov block allocator. | ||
41 | (See http://lwn.net/Articles/14633/ and | ||
42 | http://lwn.net/Articles/14446/.) | ||
43 | |||
44 | resuid=n The user ID which may use the reserved blocks. | ||
45 | resgid=n The group ID which may use the reserved blocks. | ||
46 | |||
47 | sb=n Use alternate superblock at this location. | ||
48 | |||
49 | user_xattr Enable "user." POSIX Extended Attributes | ||
50 | (requires CONFIG_EXT2_FS_XATTR). | ||
51 | See also http://acl.bestbits.at | ||
52 | nouser_xattr Don't support "user." extended attributes. | ||
53 | |||
54 | acl Enable POSIX Access Control Lists support | ||
55 | (requires CONFIG_EXT2_FS_POSIX_ACL). | ||
56 | See also http://acl.bestbits.at | ||
57 | noacl Don't support POSIX ACLs. | ||
58 | |||
59 | nobh Do not attach buffer_heads to file pagecache. | ||
60 | |||
61 | grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. | ||
62 | |||
63 | |||
64 | Specification | ||
65 | ============= | ||
66 | |||
67 | ext2 shares many properties with traditional Unix filesystems. It has | ||
68 | the concepts of blocks, inodes and directories. It has space in the | ||
69 | specification for Access Control Lists (ACLs), fragments, undeletion and | ||
70 | compression though these are not yet implemented (some are available as | ||
71 | separate patches). There is also a versioning mechanism to allow new | ||
72 | features (such as journalling) to be added in a maximally compatible | ||
73 | manner. | ||
74 | |||
75 | Blocks | ||
76 | ------ | ||
77 | |||
78 | The space in the device or file is split up into blocks. These are | ||
79 | a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), | ||
80 | which is decided when the filesystem is created. Smaller blocks mean | ||
81 | less wasted space per file, but require slightly more accounting overhead, | ||
82 | and also impose other limits on the size of files and the filesystem. | ||
83 | |||
84 | Block Groups | ||
85 | ------------ | ||
86 | |||
87 | Blocks are clustered into block groups in order to reduce fragmentation | ||
88 | and minimise the amount of head seeking when reading a large amount | ||
89 | of consecutive data. Information about each block group is kept in a | ||
90 | descriptor table stored in the block(s) immediately after the superblock. | ||
91 | Two blocks near the start of each group are reserved for the block usage | ||
92 | bitmap and the inode usage bitmap which show which blocks and inodes | ||
93 | are in use. Since each bitmap is limited to a single block, this means | ||
94 | that the maximum size of a block group is 8 times the size of a block. | ||
95 | |||
96 | The block(s) following the bitmaps in each block group are designated | ||
97 | as the inode table for that block group and the remainder are the data | ||
98 | blocks. The block allocation algorithm attempts to allocate data blocks | ||
99 | in the same block group as the inode which contains them. | ||
100 | |||
101 | The Superblock | ||
102 | -------------- | ||
103 | |||
104 | The superblock contains all the information about the configuration of | ||
105 | the filing system. The primary copy of the superblock is stored at an | ||
106 | offset of 1024 bytes from the start of the device, and it is essential | ||
107 | to mounting the filesystem. Since it is so important, backup copies of | ||
108 | the superblock are stored in block groups throughout the filesystem. | ||
109 | The first version of ext2 (revision 0) stores a copy at the start of | ||
110 | every block group, along with backups of the group descriptor block(s). | ||
111 | Because this can consume a considerable amount of space for large | ||
112 | filesystems, later revisions can optionally reduce the number of backup | ||
113 | copies by only putting backups in specific groups (this is the sparse | ||
114 | superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. | ||
115 | |||
116 | The information in the superblock contains fields such as the total | ||
117 | number of inodes and blocks in the filesystem and how many are free, | ||
118 | how many inodes and blocks are in each block group, when the filesystem | ||
119 | was mounted (and if it was cleanly unmounted), when it was modified, | ||
120 | what version of the filesystem it is (see the Revisions section below) | ||
121 | and which OS created it. | ||
122 | |||
123 | If the filesystem is revision 1 or higher, then there are extra fields, | ||
124 | such as a volume name, a unique identification number, the inode size, | ||
125 | and space for optional filesystem features to store configuration info. | ||
126 | |||
127 | All fields in the superblock (as in all other ext2 structures) are stored | ||
128 | on the disc in little endian format, so a filesystem is portable between | ||
129 | machines without having to know what machine it was created on. | ||
130 | |||
131 | Inodes | ||
132 | ------ | ||
133 | |||
134 | The inode (index node) is a fundamental concept in the ext2 filesystem. | ||
135 | Each object in the filesystem is represented by an inode. The inode | ||
136 | structure contains pointers to the filesystem blocks which contain the | ||
137 | data held in the object and all of the metadata about an object except | ||
138 | its name. The metadata about an object includes the permissions, owner, | ||
139 | group, flags, size, number of blocks used, access time, change time, | ||
140 | modification time, deletion time, number of links, fragments, version | ||
141 | (for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). | ||
142 | |||
143 | There are some reserved fields which are currently unused in the inode | ||
144 | structure and several which are overloaded. One field is reserved for the | ||
145 | directory ACL if the inode is a directory and alternately for the top 32 | ||
146 | bits of the file size if the inode is a regular file (allowing file sizes | ||
147 | larger than 2GB). The translator field is unused under Linux, but is used | ||
148 | by the HURD to reference the inode of a program which will be used to | ||
149 | interpret this object. Most of the remaining reserved fields have been | ||
150 | used up for both Linux and the HURD for larger owner and group fields, | ||
151 | The HURD also has a larger mode field so it uses another of the remaining | ||
152 | fields to store the extra more bits. | ||
153 | |||
154 | There are pointers to the first 12 blocks which contain the file's data | ||
155 | in the inode. There is a pointer to an indirect block (which contains | ||
156 | pointers to the next set of blocks), a pointer to a doubly-indirect | ||
157 | block (which contains pointers to indirect blocks) and a pointer to a | ||
158 | trebly-indirect block (which contains pointers to doubly-indirect blocks). | ||
159 | |||
160 | The flags field contains some ext2-specific flags which aren't catered | ||
161 | for by the standard chmod flags. These flags can be listed with lsattr | ||
162 | and changed with the chattr command, and allow specific filesystem | ||
163 | behaviour on a per-file basis. There are flags for secure deletion, | ||
164 | undeletable, compression, synchronous updates, immutability, append-only, | ||
165 | dumpable, no-atime, indexed directories, and data-journaling. Not all | ||
166 | of these are supported yet. | ||
167 | |||
168 | Directories | ||
169 | ----------- | ||
170 | |||
171 | A directory is a filesystem object and has an inode just like a file. | ||
172 | It is a specially formatted file containing records which associate | ||
173 | each name with an inode number. Later revisions of the filesystem also | ||
174 | encode the type of the object (file, directory, symlink, device, fifo, | ||
175 | socket) to avoid the need to check the inode itself for this information | ||
176 | (support for taking advantage of this feature does not yet exist in | ||
177 | Glibc 2.2). | ||
178 | |||
179 | The inode allocation code tries to assign inodes which are in the same | ||
180 | block group as the directory in which they are first created. | ||
181 | |||
182 | The current implementation of ext2 uses a singly-linked list to store | ||
183 | the filenames in the directory; a pending enhancement uses hashing of the | ||
184 | filenames to allow lookup without the need to scan the entire directory. | ||
185 | |||
186 | The current implementation never removes empty directory blocks once they | ||
187 | have been allocated to hold more files. | ||
188 | |||
189 | Special files | ||
190 | ------------- | ||
191 | |||
192 | Symbolic links are also filesystem objects with inodes. They deserve | ||
193 | special mention because the data for them is stored within the inode | ||
194 | itself if the symlink is less than 60 bytes long. It uses the fields | ||
195 | which would normally be used to store the pointers to data blocks. | ||
196 | This is a worthwhile optimisation as it we avoid allocating a full | ||
197 | block for the symlink, and most symlinks are less than 60 characters long. | ||
198 | |||
199 | Character and block special devices never have data blocks assigned to | ||
200 | them. Instead, their device number is stored in the inode, again reusing | ||
201 | the fields which would be used to point to the data blocks. | ||
202 | |||
203 | Reserved Space | ||
204 | -------------- | ||
205 | |||
206 | In ext2, there is a mechanism for reserving a certain number of blocks | ||
207 | for a particular user (normally the super-user). This is intended to | ||
208 | allow for the system to continue functioning even if non-priveleged users | ||
209 | fill up all the space available to them (this is independent of filesystem | ||
210 | quotas). It also keeps the filesystem from filling up entirely which | ||
211 | helps combat fragmentation. | ||
212 | |||
213 | Filesystem check | ||
214 | ---------------- | ||
215 | |||
216 | At boot time, most systems run a consistency check (e2fsck) on their | ||
217 | filesystems. The superblock of the ext2 filesystem contains several | ||
218 | fields which indicate whether fsck should actually run (since checking | ||
219 | the filesystem at boot can take a long time if it is large). fsck will | ||
220 | run if the filesystem was not cleanly unmounted, if the maximum mount | ||
221 | count has been exceeded or if the maximum time between checks has been | ||
222 | exceeded. | ||
223 | |||
224 | Feature Compatibility | ||
225 | --------------------- | ||
226 | |||
227 | The compatibility feature mechanism used in ext2 is sophisticated. | ||
228 | It safely allows features to be added to the filesystem, without | ||
229 | unnecessarily sacrificing compatibility with older versions of the | ||
230 | filesystem code. The feature compatibility mechanism is not supported by | ||
231 | the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in | ||
232 | revision 1. There are three 32-bit fields, one for compatible features | ||
233 | (COMPAT), one for read-only compatible (RO_COMPAT) features and one for | ||
234 | incompatible (INCOMPAT) features. | ||
235 | |||
236 | These feature flags have specific meanings for the kernel as follows: | ||
237 | |||
238 | A COMPAT flag indicates that a feature is present in the filesystem, | ||
239 | but the on-disk format is 100% compatible with older on-disk formats, so | ||
240 | a kernel which didn't know anything about this feature could read/write | ||
241 | the filesystem without any chance of corrupting the filesystem (or even | ||
242 | making it inconsistent). This is essentially just a flag which says | ||
243 | "this filesystem has a (hidden) feature" that the kernel or e2fsck may | ||
244 | want to be aware of (more on e2fsck and feature flags later). The ext3 | ||
245 | HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply | ||
246 | a regular file with data blocks in it so the kernel does not need to | ||
247 | take any special notice of it if it doesn't understand ext3 journaling. | ||
248 | |||
249 | An RO_COMPAT flag indicates that the on-disk format is 100% compatible | ||
250 | with older on-disk formats for reading (i.e. the feature does not change | ||
251 | the visible on-disk format). However, an old kernel writing to such a | ||
252 | filesystem would/could corrupt the filesystem, so this is prevented. The | ||
253 | most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because | ||
254 | sparse groups allow file data blocks where superblock/group descriptor | ||
255 | backups used to live, and ext2_free_blocks() refuses to free these blocks, | ||
256 | which would leading to inconsistent bitmaps. An old kernel would also | ||
257 | get an error if it tried to free a series of blocks which crossed a group | ||
258 | boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. | ||
259 | |||
260 | An INCOMPAT flag indicates the on-disk format has changed in some | ||
261 | way that makes it unreadable by older kernels, or would otherwise | ||
262 | cause a problem if an old kernel tried to mount it. FILETYPE is an | ||
263 | INCOMPAT flag because older kernels would think a filename was longer | ||
264 | than 256 characters, which would lead to corrupt directory listings. | ||
265 | The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel | ||
266 | doesn't understand compression, you would just get garbage back from | ||
267 | read() instead of it automatically decompressing your data. The ext3 | ||
268 | RECOVER flag is needed to prevent a kernel which does not understand the | ||
269 | ext3 journal from mounting the filesystem without replaying the journal. | ||
270 | |||
271 | For e2fsck, it needs to be more strict with the handling of these | ||
272 | flags than the kernel. If it doesn't understand ANY of the COMPAT, | ||
273 | RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, | ||
274 | because it has no way of verifying whether a given feature is valid | ||
275 | or not. Allowing e2fsck to succeed on a filesystem with an unknown | ||
276 | feature is a false sense of security for the user. Refusing to check | ||
277 | a filesystem with unknown features is a good incentive for the user to | ||
278 | update to the latest e2fsck. This also means that anyone adding feature | ||
279 | flags to ext2 also needs to update e2fsck to verify these features. | ||
280 | |||
281 | Metadata | ||
282 | -------- | ||
283 | |||
284 | It is frequently claimed that the ext2 implementation of writing | ||
285 | asynchronous metadata is faster than the ffs synchronous metadata | ||
286 | scheme but less reliable. Both methods are equally resolvable by their | ||
287 | respective fsck programs. | ||
288 | |||
289 | If you're exceptionally paranoid, there are 3 ways of making metadata | ||
290 | writes synchronous on ext2: | ||
291 | |||
292 | per-file if you have the program source: use the O_SYNC flag to open() | ||
293 | per-file if you don't have the source: use "chattr +S" on the file | ||
294 | per-filesystem: add the "sync" option to mount (or in /etc/fstab) | ||
295 | |||
296 | the first and last are not ext2 specific but do force the metadata to | ||
297 | be written synchronously. See also Journaling below. | ||
298 | |||
299 | Limitations | ||
300 | ----------- | ||
301 | |||
302 | There are various limits imposed by the on-disk layout of ext2. Other | ||
303 | limits are imposed by the current implementation of the kernel code. | ||
304 | Many of the limits are determined at the time the filesystem is first | ||
305 | created, and depend upon the block size chosen. The ratio of inodes to | ||
306 | data blocks is fixed at filesystem creation time, so the only way to | ||
307 | increase the number of inodes is to increase the size of the filesystem. | ||
308 | No tools currently exist which can change the ratio of inodes to blocks. | ||
309 | |||
310 | Most of these limits could be overcome with slight changes in the on-disk | ||
311 | format and using a compatibility flag to signal the format change (at | ||
312 | the expense of some compatibility). | ||
313 | |||
314 | Filesystem block size: 1kB 2kB 4kB 8kB | ||
315 | |||
316 | File size limit: 16GB 256GB 2048GB 2048GB | ||
317 | Filesystem size limit: 2047GB 8192GB 16384GB 32768GB | ||
318 | |||
319 | There is a 2.4 kernel limit of 2048GB for a single block device, so no | ||
320 | filesystem larger than that can be created at this time. There is also | ||
321 | an upper limit on the block size imposed by the page size of the kernel, | ||
322 | so 8kB blocks are only allowed on Alpha systems (and other architectures | ||
323 | which support larger pages). | ||
324 | |||
325 | There is an upper limit of 32768 subdirectories in a single directory. | ||
326 | |||
327 | There is a "soft" upper limit of about 10-15k files in a single directory | ||
328 | with the current linear linked-list directory implementation. This limit | ||
329 | stems from performance problems when creating and deleting (and also | ||
330 | finding) files in such large directories. Using a hashed directory index | ||
331 | (under development) allows 100k-1M+ files in a single directory without | ||
332 | performance problems (although RAM size becomes an issue at this point). | ||
333 | |||
334 | The (meaningless) absolute upper limit of files in a single directory | ||
335 | (imposed by the file size, the realistic limit is obviously much less) | ||
336 | is over 130 trillion files. It would be higher except there are not | ||
337 | enough 4-character names to make up unique directory entries, so they | ||
338 | have to be 8 character filenames, even then we are fairly close to | ||
339 | running out of unique filenames. | ||
340 | |||
341 | Journaling | ||
342 | ---------- | ||
343 | |||
344 | A journaling extension to the ext2 code has been developed by Stephen | ||
345 | Tweedie. It avoids the risks of metadata corruption and the need to | ||
346 | wait for e2fsck to complete after a crash, without requiring a change | ||
347 | to the on-disk ext2 layout. In a nutshell, the journal is a regular | ||
348 | file which stores whole metadata (and optionally data) blocks that have | ||
349 | been modified, prior to writing them into the filesystem. This means | ||
350 | it is possible to add a journal to an existing ext2 filesystem without | ||
351 | the need for data conversion. | ||
352 | |||
353 | When changes to the filesystem (e.g. a file is renamed) they are stored in | ||
354 | a transaction in the journal and can either be complete or incomplete at | ||
355 | the time of a crash. If a transaction is complete at the time of a crash | ||
356 | (or in the normal case where the system does not crash), then any blocks | ||
357 | in that transaction are guaranteed to represent a valid filesystem state, | ||
358 | and are copied into the filesystem. If a transaction is incomplete at | ||
359 | the time of the crash, then there is no guarantee of consistency for | ||
360 | the blocks in that transaction so they are discarded (which means any | ||
361 | filesystem changes they represent are also lost). | ||
362 | Check Documentation/filesystems/ext3.txt if you want to read more about | ||
363 | ext3 and journaling. | ||
364 | |||
365 | References | ||
366 | ========== | ||
367 | |||
368 | The kernel source file:/usr/src/linux/fs/ext2/ | ||
369 | e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ | ||
370 | Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html | ||
371 | Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ | ||
372 | Hashed Directories http://kernelnewbies.org/~phillips/htree/ | ||
373 | Filesystem Resizing http://ext2resize.sourceforge.net/ | ||
374 | Compression (*) http://www.netspace.net.au/~reiter/e2compr/ | ||
375 | |||
376 | Implementations for: | ||
377 | Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm | ||
378 | Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2 | ||
379 | DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ | ||
380 | OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/ | ||
381 | RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/ | ||
382 | |||
383 | (*) no longer actively developed/supported (as of Apr 2001) | ||