2 files changed, 243 insertions, 0 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 875d49696b6e..5139b8c9d5af 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -62,6 +62,8 @@ jfs.txt
        - info and mount options for the JFS filesystem.
 locks.txt
        - info on file locking implementations, flock() vs. fcntl(), etc.
+logfs.txt
+        - info on the LogFS flash filesystem.
 mandatory-locking.txt
        - info on the Linux implementation of Sys V mandatory file locking.
 ncpfs.txt
diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt
new file mode 100644
index 000000000000..e64c94ba401a
--- /dev/null
+++ b/Documentation/filesystems/logfs.txt
@@ -0,0 +1,241 @@
+The LogFS Flash Filesystem
+==========================
+Specification
+=============
+Superblocks
+-----------
+Two superblocks exist at the beginning and end of the filesystem.
+Each superblock is 256 Bytes large, with another 3840 Bytes reserved
+for future purposes, making a total of 4096 Bytes.
+Superblock locations may differ for MTD and block devices.  On MTD the
+first non-bad block contains a superblock in the first 4096 Bytes and
+the last non-bad block contains a superblock in the last 4096 Bytes.
+On block devices, the first 4096 Bytes of the device contain the first
+superblock and the last aligned 4096 Byte-block contains the second
+superblock.
+For the most part, the superblocks can be considered read-only.  They
+are written only to correct errors detected within the superblocks,
+move the journal and change the filesystem parameters through tunefs.
+As a result, the superblock does not contain any fields that require
+constant updates, like the amount of free space, etc.
+Segments
+--------
+The space in the device is split up into equal-sized segments.
+Segments are the primary write unit of LogFS.  Within each segments,
+writes happen from front (low addresses) to back (high addresses.  If
+only a partial segment has been written, the segment number, the
+current position within and optionally a write buffer are stored in
+the journal.
+Segments are erased as a whole.  Therefore Garbage Collection may be
+required to completely free a segment before doing so.
+Journal
+--------
+The journal contains all global information about the filesystem that
+is subject to frequent change.  At mount time, it has to be scanned
+for the most recent commit entry, which contains a list of pointers to
+all currently valid entries.
+Object Store
+------------
+All space except for the superblocks and journal is part of the object
+store.  Each segment contains a segment header and a number of
+objects, each consisting of the object header and the payload.
+Objects are either inodes, directory entries (dentries), file data
+blocks or indirect blocks.
+Levels
+------
+Garbage collection (GC) may fail if all data is written
+indiscriminately.  One requirement of GC is that data is seperated
+roughly according to the distance between the tree root and the data.
+Effectively that means all file data is on level 0, indirect blocks
+are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
+respectively.  Inode file data is on level 6 for the inodes and 7-11
+for indirect blocks.
+Each segment contains objects of a single level only.  As a result,
+each level requires its own seperate segment to be open for writing.
+Inode File
+----------
+All inodes are stored in a special file, the inode file.  Single
+exception is the inode file's inode (master inode) which for obvious
+reasons is stored in the journal instead.  Instead of data blocks, the
+leaf nodes of the inode files are inodes.
+Aliases
+-------
+Writes in LogFS are done by means of a wandering tree.  A naïve
+implementation would require that for each write or a block, all
+parent blocks are written as well, since the block pointers have
+changed.  Such an implementation would not be very efficient.
+In LogFS, the block pointer changes are cached in the journal by means
+of alias entries.  Each alias consists of its logical address - inode
+number, block index, level and child number (index into block) - and
+the changed data.  Any 8-byte word can be changes in this manner.
+Currently aliases are used for block pointers, file size, file used
+bytes and the height of an inodes indirect tree.
+Segment Aliases
+---------------
+Related to regular aliases, these are used to handle bad blocks.
+Initially, bad blocks are handled by moving the affected segment
+content to a spare segment and noting this move in the journal with a
+segment alias, a simple (to, from) tupel.  GC will later empty this
+segment and the alias can be removed again.  This is used on MTD only.
+Vim
+---
+By cleverly predicting the life time of data, it is possible to
+seperate long-living data from short-living data and thereby reduce
+the GC overhead later.  Each type of distinc life expectency (vim) can
+have a seperate segment open for writing.  Each (level, vim) tupel can
+be open just once.  If an open segment with unknown vim is encountered
+at mount time, it is closed and ignored henceforth.
+Indirect Tree
+-------------
+Inodes in LogFS are similar to FFS-style filesystems with direct and
+indirect block pointers.  One difference is that LogFS uses a single
+indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
+A height field in the inode defines the height of the indirect tree
+and thereby the indirection of the pointer.
+Another difference is the addressing of indirect blocks.  In LogFS,
+the first 16 pointers in the first indirect block are left empty,
+corresponding to the 16 direct pointers in the inode.  In ext2 (maybe
+others as well) the first pointer in the first indirect block
+corresponds to logical block 12, skipping the 12 direct pointers.
+So where ext2 is using arithmetic to better utilize space, LogFS keeps
+arithmetic simple and uses compression to save space.
+Compression
+-----------
+Both file data and metadata can be compressed.  Compression for file
+data can be enabled with chattr +c and disabled with chattr -c.  Doing
+so has no effect on existing data, but new data will be stored
+accordingly.  New inodes will inherit the compression flag of the
+parent directory.
+Metadata is always compressed.  However, the space accounting ignores
+this and charges for the uncompressed size.  Failing to do so could
+result in GC failures when, after moving some data, indirect blocks
+compress worse than previously.  Even on a 100% full medium, GC may
+not consume any extra space, so the compression gains are lost space
+to the user.
+However, they are not lost space to the filesystem internals.  By
+cheating the user for those bytes, the filesystem gained some slack
+space and GC will run less often and faster.
+Garbage Collection and Wear Leveling
+------------------------------------
+Garbage collection is invoked whenever the number of free segments
+falls below a threshold.  The best (known) candidate is picked based
+on the least amount of valid data contained in the segment.  All
+remaining valid data is copied elsewhere, thereby invalidating it.
+The GC code also checks for aliases and writes then back if their
+number gets too large.
+Wear leveling is done by occasionally picking a suboptimal segment for
+garbage collection.  If a stale segments erase count is significantly
+lower than the active segments' erase counts, it will be picked.  Wear
+leveling is rate limited, so it will never monopolize the device for
+more than one segment worth at a time.
+Values for "occasionally", "significantly lower" are compile time
+constants.
+Hashed directories
+------------------
+To satisfy efficient lookup(), directory entries are hashed and
+located based on the hash.  In order to both support large directories
+and not be overly inefficient for small directories, several hash
+tables of increasing size are used.  For each table, the hash value
+modulo the table size gives the table index.
+Tables sizes are chosen to limit the number of indirect blocks with a
+fully populated table to 0, 1, 2 or 3 respectively.  So the first
+table contains 16 entries, the second 512-16, etc.
+The last table is special in several ways.  First its size depends on
+the effective 32bit limit on telldir/seekdir cookies.  Since logfs
+uses the upper half of the address space for indirect blocks, the size
+is limited to 2^31.  Secondly the table contains hash buckets with 16
+entries each.
+Using single-entry buckets would result in birthday "attacks".  At
+just 2^16 used entries, hash collisions would be likely (P >= 0.5).
+My math skills are insufficient to do the combinatorics for the 17x
+collisions necessary to overflow a bucket, but testing showed that in
+10,000 runs the lowest directory fill before a bucket overflow was
+188,057,130 entries with an average of 315,149,915 entries.  So for
+directory sizes of up to a million, bucket overflows should be
+virtually impossible under normal circumstances.
+With carefully chosen filenames, it is obviously possible to cause an
+overflow with just 21 entries (4 higher tables + 16 entries + 1).  So
+there may be a security concern if a malicious user has write access
+to a directory.
+Open For Discussion
+===================
+Device Address Space
+--------------------
+A device address space is used for caching.  Both block devices and
+MTD provide functions to either read a single page or write a segment.
+Partial segments may be written for data integrity, but where possible
+complete segments are written for performance on simple block device
+flash media.
+Meta Inodes
+-----------
+Inodes are stored in the inode file, which is just a regular file for
+most purposes.  At umount time, however, the inode file needs to
+remain open until all dirty inodes are written.  So
+generic_shutdown_super() may not close this inode, but shouldn't
+complain about remaining inodes due to the inode file either.  Same
+goes for mapping inode of the device address space.
+Currently logfs uses a hack that essentially copies part of fs/inode.c
+code over.  A general solution would be preferred.
+Indirect block mapping
+----------------------
+With compression, the block device (or mapping inode) cannot be used
+to cache indirect blocks.  Some other place is required.  Currently
+logfs uses the top half of each inode's address space.  The low 8TB
+(on 32bit) are filled with file data, the high 8TB are used for
+indirect blocks.
+One problem is that 16TB files created on 64bit systems actually have
+data in the top 8TB.  But files >16TB would cause problems anyway, so
+only the limit has changed.

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 875d49696b6e..5139b8c9d5af 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX
@@ -62,6 +62,8 @@ jfs.txt
62	- info and mount options for the JFS filesystem.	62	- info and mount options for the JFS filesystem.
63	locks.txt	63	locks.txt
64	- info on file locking implementations, flock() vs. fcntl(), etc.	64	- info on file locking implementations, flock() vs. fcntl(), etc.
		65	logfs.txt
		66	- info on the LogFS flash filesystem.
65	mandatory-locking.txt	67	mandatory-locking.txt
66	- info on the Linux implementation of Sys V mandatory file locking.	68	- info on the Linux implementation of Sys V mandatory file locking.
67	ncpfs.txt	69	ncpfs.txt


diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt new file mode 100644 index 000000000000..e64c94ba401a --- /dev/null +++ b/Documentation/filesystems/logfs.txt
@@ -0,0 +1,241 @@
		1
		2	The LogFS Flash Filesystem
		3	==========================
		4
		5	Specification
		6	=============
		7
		8	Superblocks
		9	-----------
		10
		11	Two superblocks exist at the beginning and end of the filesystem.
		12	Each superblock is 256 Bytes large, with another 3840 Bytes reserved
		13	for future purposes, making a total of 4096 Bytes.
		14
		15	Superblock locations may differ for MTD and block devices. On MTD the
		16	first non-bad block contains a superblock in the first 4096 Bytes and
		17	the last non-bad block contains a superblock in the last 4096 Bytes.
		18	On block devices, the first 4096 Bytes of the device contain the first
		19	superblock and the last aligned 4096 Byte-block contains the second
		20	superblock.
		21
		22	For the most part, the superblocks can be considered read-only. They
		23	are written only to correct errors detected within the superblocks,
		24	move the journal and change the filesystem parameters through tunefs.
		25	As a result, the superblock does not contain any fields that require
		26	constant updates, like the amount of free space, etc.
		27
		28	Segments
		29	--------
		30
		31	The space in the device is split up into equal-sized segments.
		32	Segments are the primary write unit of LogFS. Within each segments,
		33	writes happen from front (low addresses) to back (high addresses. If
		34	only a partial segment has been written, the segment number, the
		35	current position within and optionally a write buffer are stored in
		36	the journal.
		37
		38	Segments are erased as a whole. Therefore Garbage Collection may be
		39	required to completely free a segment before doing so.
		40
		41	Journal
		42	--------
		43
		44	The journal contains all global information about the filesystem that
		45	is subject to frequent change. At mount time, it has to be scanned
		46	for the most recent commit entry, which contains a list of pointers to
		47	all currently valid entries.
		48
		49	Object Store
		50	------------
		51
		52	All space except for the superblocks and journal is part of the object
		53	store. Each segment contains a segment header and a number of
		54	objects, each consisting of the object header and the payload.
		55	Objects are either inodes, directory entries (dentries), file data
		56	blocks or indirect blocks.
		57
		58	Levels
		59	------
		60
		61	Garbage collection (GC) may fail if all data is written
		62	indiscriminately. One requirement of GC is that data is seperated
		63	roughly according to the distance between the tree root and the data.
		64	Effectively that means all file data is on level 0, indirect blocks
		65	are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
		66	respectively. Inode file data is on level 6 for the inodes and 7-11
		67	for indirect blocks.
		68
		69	Each segment contains objects of a single level only. As a result,
		70	each level requires its own seperate segment to be open for writing.
		71
		72	Inode File
		73	----------
		74
		75	All inodes are stored in a special file, the inode file. Single
		76	exception is the inode file's inode (master inode) which for obvious
		77	reasons is stored in the journal instead. Instead of data blocks, the
		78	leaf nodes of the inode files are inodes.
		79
		80	Aliases
		81	-------
		82
		83	Writes in LogFS are done by means of a wandering tree. A naïve
		84	implementation would require that for each write or a block, all
		85	parent blocks are written as well, since the block pointers have
		86	changed. Such an implementation would not be very efficient.
		87
		88	In LogFS, the block pointer changes are cached in the journal by means
		89	of alias entries. Each alias consists of its logical address - inode
		90	number, block index, level and child number (index into block) - and
		91	the changed data. Any 8-byte word can be changes in this manner.
		92
		93	Currently aliases are used for block pointers, file size, file used
		94	bytes and the height of an inodes indirect tree.
		95
		96	Segment Aliases
		97	---------------
		98
		99	Related to regular aliases, these are used to handle bad blocks.
		100	Initially, bad blocks are handled by moving the affected segment
		101	content to a spare segment and noting this move in the journal with a
		102	segment alias, a simple (to, from) tupel. GC will later empty this
		103	segment and the alias can be removed again. This is used on MTD only.
		104
		105	Vim
		106	---
		107
		108	By cleverly predicting the life time of data, it is possible to
		109	seperate long-living data from short-living data and thereby reduce
		110	the GC overhead later. Each type of distinc life expectency (vim) can
		111	have a seperate segment open for writing. Each (level, vim) tupel can
		112	be open just once. If an open segment with unknown vim is encountered
		113	at mount time, it is closed and ignored henceforth.
		114
		115	Indirect Tree
		116	-------------
		117
		118	Inodes in LogFS are similar to FFS-style filesystems with direct and
		119	indirect block pointers. One difference is that LogFS uses a single
		120	indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
		121	A height field in the inode defines the height of the indirect tree
		122	and thereby the indirection of the pointer.
		123
		124	Another difference is the addressing of indirect blocks. In LogFS,
		125	the first 16 pointers in the first indirect block are left empty,
		126	corresponding to the 16 direct pointers in the inode. In ext2 (maybe
		127	others as well) the first pointer in the first indirect block
		128	corresponds to logical block 12, skipping the 12 direct pointers.
		129	So where ext2 is using arithmetic to better utilize space, LogFS keeps
		130	arithmetic simple and uses compression to save space.
		131
		132	Compression
		133	-----------
		134
		135	Both file data and metadata can be compressed. Compression for file
		136	data can be enabled with chattr +c and disabled with chattr -c. Doing
		137	so has no effect on existing data, but new data will be stored
		138	accordingly. New inodes will inherit the compression flag of the
		139	parent directory.
		140
		141	Metadata is always compressed. However, the space accounting ignores
		142	this and charges for the uncompressed size. Failing to do so could
		143	result in GC failures when, after moving some data, indirect blocks
		144	compress worse than previously. Even on a 100% full medium, GC may
		145	not consume any extra space, so the compression gains are lost space
		146	to the user.
		147
		148	However, they are not lost space to the filesystem internals. By
		149	cheating the user for those bytes, the filesystem gained some slack
		150	space and GC will run less often and faster.
		151
		152	Garbage Collection and Wear Leveling
		153	------------------------------------
		154
		155	Garbage collection is invoked whenever the number of free segments
		156	falls below a threshold. The best (known) candidate is picked based
		157	on the least amount of valid data contained in the segment. All
		158	remaining valid data is copied elsewhere, thereby invalidating it.
		159
		160	The GC code also checks for aliases and writes then back if their
		161	number gets too large.
		162
		163	Wear leveling is done by occasionally picking a suboptimal segment for
		164	garbage collection. If a stale segments erase count is significantly
		165	lower than the active segments' erase counts, it will be picked. Wear
		166	leveling is rate limited, so it will never monopolize the device for
		167	more than one segment worth at a time.
		168
		169	Values for "occasionally", "significantly lower" are compile time
		170	constants.
		171
		172	Hashed directories
		173	------------------
		174
		175	To satisfy efficient lookup(), directory entries are hashed and
		176	located based on the hash. In order to both support large directories
		177	and not be overly inefficient for small directories, several hash
		178	tables of increasing size are used. For each table, the hash value
		179	modulo the table size gives the table index.
		180
		181	Tables sizes are chosen to limit the number of indirect blocks with a
		182	fully populated table to 0, 1, 2 or 3 respectively. So the first
		183	table contains 16 entries, the second 512-16, etc.
		184
		185	The last table is special in several ways. First its size depends on
		186	the effective 32bit limit on telldir/seekdir cookies. Since logfs
		187	uses the upper half of the address space for indirect blocks, the size
		188	is limited to 2^31. Secondly the table contains hash buckets with 16
		189	entries each.
		190
		191	Using single-entry buckets would result in birthday "attacks". At
		192	just 2^16 used entries, hash collisions would be likely (P >= 0.5).
		193	My math skills are insufficient to do the combinatorics for the 17x
		194	collisions necessary to overflow a bucket, but testing showed that in
		195	10,000 runs the lowest directory fill before a bucket overflow was
		196	188,057,130 entries with an average of 315,149,915 entries. So for
		197	directory sizes of up to a million, bucket overflows should be
		198	virtually impossible under normal circumstances.
		199
		200	With carefully chosen filenames, it is obviously possible to cause an
		201	overflow with just 21 entries (4 higher tables + 16 entries + 1). So
		202	there may be a security concern if a malicious user has write access
		203	to a directory.
		204
		205	Open For Discussion
		206	===================
		207
		208	Device Address Space
		209	--------------------
		210
		211	A device address space is used for caching. Both block devices and
		212	MTD provide functions to either read a single page or write a segment.
		213	Partial segments may be written for data integrity, but where possible
		214	complete segments are written for performance on simple block device
		215	flash media.
		216
		217	Meta Inodes
		218	-----------
		219
		220	Inodes are stored in the inode file, which is just a regular file for
		221	most purposes. At umount time, however, the inode file needs to
		222	remain open until all dirty inodes are written. So
		223	generic_shutdown_super() may not close this inode, but shouldn't
		224	complain about remaining inodes due to the inode file either. Same
		225	goes for mapping inode of the device address space.
		226
		227	Currently logfs uses a hack that essentially copies part of fs/inode.c
		228	code over. A general solution would be preferred.
		229
		230	Indirect block mapping
		231	----------------------
		232
		233	With compression, the block device (or mapping inode) cannot be used
		234	to cache indirect blocks. Some other place is required. Currently
		235	logfs uses the top half of each inode's address space. The low 8TB
		236	(on 32bit) are filled with file data, the high 8TB are used for
		237	indirect blocks.
		238
		239	One problem is that 16TB files created on 64bit systems actually have
		240	data in the top 8TB. But files >16TB would cause problems anyway, so
		241	only the limit has changed.