diff options
| -rw-r--r-- | Documentation/filesystems/squashfs.txt | 225 |
1 files changed, 225 insertions, 0 deletions
diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt new file mode 100644 index 000000000000..3e79e4a7a392 --- /dev/null +++ b/Documentation/filesystems/squashfs.txt | |||
| @@ -0,0 +1,225 @@ | |||
| 1 | SQUASHFS 4.0 FILESYSTEM | ||
| 2 | ======================= | ||
| 3 | |||
| 4 | Squashfs is a compressed read-only filesystem for Linux. | ||
| 5 | It uses zlib compression to compress files, inodes and directories. | ||
| 6 | Inodes in the system are very small and all blocks are packed to minimise | ||
| 7 | data overhead. Block sizes greater than 4K are supported up to a maximum | ||
| 8 | of 1Mbytes (default block size 128K). | ||
| 9 | |||
| 10 | Squashfs is intended for general read-only filesystem use, for archival | ||
| 11 | use (i.e. in cases where a .tar.gz file may be used), and in constrained | ||
| 12 | block device/memory systems (e.g. embedded systems) where low overhead is | ||
| 13 | needed. | ||
| 14 | |||
| 15 | Mailing list: squashfs-devel@lists.sourceforge.net | ||
| 16 | Web site: www.squashfs.org | ||
| 17 | |||
| 18 | 1. FILESYSTEM FEATURES | ||
| 19 | ---------------------- | ||
| 20 | |||
| 21 | Squashfs filesystem features versus Cramfs: | ||
| 22 | |||
| 23 | Squashfs Cramfs | ||
| 24 | |||
| 25 | Max filesystem size: 2^64 16 MiB | ||
| 26 | Max file size: ~ 2 TiB 16 MiB | ||
| 27 | Max files: unlimited unlimited | ||
| 28 | Max directories: unlimited unlimited | ||
| 29 | Max entries per directory: unlimited unlimited | ||
| 30 | Max block size: 1 MiB 4 KiB | ||
| 31 | Metadata compression: yes no | ||
| 32 | Directory indexes: yes no | ||
| 33 | Sparse file support: yes no | ||
| 34 | Tail-end packing (fragments): yes no | ||
| 35 | Exportable (NFS etc.): yes no | ||
| 36 | Hard link support: yes no | ||
| 37 | "." and ".." in readdir: yes no | ||
| 38 | Real inode numbers: yes no | ||
| 39 | 32-bit uids/gids: yes no | ||
| 40 | File creation time: yes no | ||
| 41 | Xattr and ACL support: no no | ||
| 42 | |||
| 43 | Squashfs compresses data, inodes and directories. In addition, inode and | ||
| 44 | directory data are highly compacted, and packed on byte boundaries. Each | ||
| 45 | compressed inode is on average 8 bytes in length (the exact length varies on | ||
| 46 | file type, i.e. regular file, directory, symbolic link, and block/char device | ||
| 47 | inodes have different sizes). | ||
| 48 | |||
| 49 | 2. USING SQUASHFS | ||
| 50 | ----------------- | ||
| 51 | |||
| 52 | As squashfs is a read-only filesystem, the mksquashfs program must be used to | ||
| 53 | create populated squashfs filesystems. This and other squashfs utilities | ||
| 54 | can be obtained from http://www.squashfs.org. Usage instructions can be | ||
| 55 | obtained from this site also. | ||
| 56 | |||
| 57 | |||
| 58 | 3. SQUASHFS FILESYSTEM DESIGN | ||
| 59 | ----------------------------- | ||
| 60 | |||
| 61 | A squashfs filesystem consists of seven parts, packed together on a byte | ||
| 62 | alignment: | ||
| 63 | |||
| 64 | --------------- | ||
| 65 | | superblock | | ||
| 66 | |---------------| | ||
| 67 | | datablocks | | ||
| 68 | | & fragments | | ||
| 69 | |---------------| | ||
| 70 | | inode table | | ||
| 71 | |---------------| | ||
| 72 | | directory | | ||
| 73 | | table | | ||
| 74 | |---------------| | ||
| 75 | | fragment | | ||
| 76 | | table | | ||
| 77 | |---------------| | ||
| 78 | | export | | ||
| 79 | | table | | ||
| 80 | |---------------| | ||
| 81 | | uid/gid | | ||
| 82 | | lookup table | | ||
| 83 | --------------- | ||
| 84 | |||
| 85 | Compressed data blocks are written to the filesystem as files are read from | ||
| 86 | the source directory, and checked for duplicates. Once all file data has been | ||
| 87 | written the completed inode, directory, fragment, export and uid/gid lookup | ||
| 88 | tables are written. | ||
| 89 | |||
| 90 | 3.1 Inodes | ||
| 91 | ---------- | ||
| 92 | |||
| 93 | Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each | ||
| 94 | compressed block is prefixed by a two byte length, the top bit is set if the | ||
| 95 | block is uncompressed. A block will be uncompressed if the -noI option is set, | ||
| 96 | or if the compressed block was larger than the uncompressed block. | ||
| 97 | |||
| 98 | Inodes are packed into the metadata blocks, and are not aligned to block | ||
| 99 | boundaries, therefore inodes overlap compressed blocks. Inodes are identified | ||
| 100 | by a 48-bit number which encodes the location of the compressed metadata block | ||
| 101 | containing the inode, and the byte offset into that block where the inode is | ||
| 102 | placed (<block, offset>). | ||
| 103 | |||
| 104 | To maximise compression there are different inodes for each file type | ||
| 105 | (regular file, directory, device, etc.), the inode contents and length | ||
| 106 | varying with the type. | ||
| 107 | |||
| 108 | To further maximise compression, two types of regular file inode and | ||
| 109 | directory inode are defined: inodes optimised for frequently occurring | ||
| 110 | regular files and directories, and extended types where extra | ||
| 111 | information has to be stored. | ||
| 112 | |||
| 113 | 3.2 Directories | ||
| 114 | --------------- | ||
| 115 | |||
| 116 | Like inodes, directories are packed into compressed metadata blocks, stored | ||
| 117 | in a directory table. Directories are accessed using the start address of | ||
| 118 | the metablock containing the directory and the offset into the | ||
| 119 | decompressed block (<block, offset>). | ||
| 120 | |||
| 121 | Directories are organised in a slightly complex way, and are not simply | ||
| 122 | a list of file names. The organisation takes advantage of the | ||
| 123 | fact that (in most cases) the inodes of the files will be in the same | ||
| 124 | compressed metadata block, and therefore, can share the start block. | ||
| 125 | Directories are therefore organised in a two level list, a directory | ||
| 126 | header containing the shared start block value, and a sequence of directory | ||
| 127 | entries, each of which share the shared start block. A new directory header | ||
| 128 | is written once/if the inode start block changes. The directory | ||
| 129 | header/directory entry list is repeated as many times as necessary. | ||
| 130 | |||
| 131 | Directories are sorted, and can contain a directory index to speed up | ||
| 132 | file lookup. Directory indexes store one entry per metablock, each entry | ||
| 133 | storing the index/filename mapping to the first directory header | ||
| 134 | in each metadata block. Directories are sorted in alphabetical order, | ||
| 135 | and at lookup the index is scanned linearly looking for the first filename | ||
| 136 | alphabetically larger than the filename being looked up. At this point the | ||
| 137 | location of the metadata block the filename is in has been found. | ||
| 138 | The general idea of the index is ensure only one metadata block needs to be | ||
| 139 | decompressed to do a lookup irrespective of the length of the directory. | ||
| 140 | This scheme has the advantage that it doesn't require extra memory overhead | ||
| 141 | and doesn't require much extra storage on disk. | ||
| 142 | |||
| 143 | 3.3 File data | ||
| 144 | ------------- | ||
| 145 | |||
| 146 | Regular files consist of a sequence of contiguous compressed blocks, and/or a | ||
| 147 | compressed fragment block (tail-end packed block). The compressed size | ||
| 148 | of each datablock is stored in a block list contained within the | ||
| 149 | file inode. | ||
| 150 | |||
| 151 | To speed up access to datablocks when reading 'large' files (256 Mbytes or | ||
| 152 | larger), the code implements an index cache that caches the mapping from | ||
| 153 | block index to datablock location on disk. | ||
| 154 | |||
| 155 | The index cache allows Squashfs to handle large files (up to 1.75 TiB) while | ||
| 156 | retaining a simple and space-efficient block list on disk. The cache | ||
| 157 | is split into slots, caching up to eight 224 GiB files (128 KiB blocks). | ||
| 158 | Larger files use multiple slots, with 1.75 TiB files using all 8 slots. | ||
| 159 | The index cache is designed to be memory efficient, and by default uses | ||
| 160 | 16 KiB. | ||
| 161 | |||
| 162 | 3.4 Fragment lookup table | ||
| 163 | ------------------------- | ||
| 164 | |||
| 165 | Regular files can contain a fragment index which is mapped to a fragment | ||
| 166 | location on disk and compressed size using a fragment lookup table. This | ||
| 167 | fragment lookup table is itself stored compressed into metadata blocks. | ||
| 168 | A second index table is used to locate these. This second index table for | ||
| 169 | speed of access (and because it is small) is read at mount time and cached | ||
| 170 | in memory. | ||
| 171 | |||
| 172 | 3.5 Uid/gid lookup table | ||
| 173 | ------------------------ | ||
| 174 | |||
| 175 | For space efficiency regular files store uid and gid indexes, which are | ||
| 176 | converted to 32-bit uids/gids using an id look up table. This table is | ||
| 177 | stored compressed into metadata blocks. A second index table is used to | ||
| 178 | locate these. This second index table for speed of access (and because it | ||
| 179 | is small) is read at mount time and cached in memory. | ||
| 180 | |||
| 181 | 3.6 Export table | ||
| 182 | ---------------- | ||
| 183 | |||
| 184 | To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems | ||
| 185 | can optionally (disabled with the -no-exports Mksquashfs option) contain | ||
| 186 | an inode number to inode disk location lookup table. This is required to | ||
| 187 | enable Squashfs to map inode numbers passed in filehandles to the inode | ||
| 188 | location on disk, which is necessary when the export code reinstantiates | ||
| 189 | expired/flushed inodes. | ||
| 190 | |||
| 191 | This table is stored compressed into metadata blocks. A second index table is | ||
| 192 | used to locate these. This second index table for speed of access (and because | ||
| 193 | it is small) is read at mount time and cached in memory. | ||
| 194 | |||
| 195 | |||
| 196 | 4. TODOS AND OUTSTANDING ISSUES | ||
| 197 | ------------------------------- | ||
| 198 | |||
| 199 | 4.1 Todo list | ||
| 200 | ------------- | ||
| 201 | |||
| 202 | Implement Xattr and ACL support. The Squashfs 4.0 filesystem layout has hooks | ||
| 203 | for these but the code has not been written. Once the code has been written | ||
| 204 | the existing layout should not require modification. | ||
| 205 | |||
| 206 | 4.2 Squashfs internal cache | ||
| 207 | --------------------------- | ||
| 208 | |||
| 209 | Blocks in Squashfs are compressed. To avoid repeatedly decompressing | ||
| 210 | recently accessed data Squashfs uses two small metadata and fragment caches. | ||
| 211 | |||
| 212 | The cache is not used for file datablocks, these are decompressed and cached in | ||
| 213 | the page-cache in the normal way. The cache is used to temporarily cache | ||
| 214 | fragment and metadata blocks which have been read as a result of a metadata | ||
| 215 | (i.e. inode or directory) or fragment access. Because metadata and fragments | ||
| 216 | are packed together into blocks (to gain greater compression) the read of a | ||
| 217 | particular piece of metadata or fragment will retrieve other metadata/fragments | ||
| 218 | which have been packed with it, these because of locality-of-reference may be | ||
| 219 | read in the near future. Temporarily caching them ensures they are available | ||
| 220 | for near future access without requiring an additional read and decompress. | ||
| 221 | |||
| 222 | In the future this internal cache may be replaced with an implementation which | ||
| 223 | uses the kernel page cache. Because the page cache operates on page sized | ||
| 224 | units this may introduce additional complexity in terms of locking and | ||
| 225 | associated race conditions. | ||
