Documentation: lzo: document part of the encoding

Add a complete description of the LZO format as processed by the decompressor. I have not found a public specification of this format hence this analysis, which will be used to better understand the code. Cc: Willem Pinckaers <willem@lekkertech.net> Cc: "Don A. Bailey" <donb@securitymouse.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
author: Willy Tarreau <w@1wt.eu> 2014-09-27 06:31:35 -0400
committer: Greg Kroah-Hartman <gregkh@linuxfoundation.org> 2014-09-28 05:08:00 -0400
commit: d98a0526434d27e261f622cf9d2e0028b5ff1a00 (patch)
tree: 14e08dd8af525f9d0686be31aef2af11ee10f252 /Documentation/lzo.txt
parent: 9e82bf014195d6f0054982c463575cdce24292be (diff)
1 files changed, 164 insertions, 0 deletions
diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
new file mode 100644
index 000000000000..ea45dd3901e3
--- /dev/null
+++ b/Documentation/lzo.txt
@@ -0,0 +1,164 @@
+LZO stream format as understood by Linux's LZO decompressor
+===========================================================
+Introduction
+  This is not a specification. No specification seems to be publicly available
+  for the LZO stream format. This document describes what input format the LZO
+  decompressor as implemented in the Linux kernel understands. The file subject
+  of this analysis is lib/lzo/lzo1x_decompress_safe.c. No analysis was made on
+  the compressor nor on any other implementations though it seems likely that
+  the format matches the standard one. The purpose of this document is to
+  better understand what the code does in order to propose more efficient fixes
+  for future bug reports.
+Description
+  The stream is composed of a series of instructions, operands, and data. The
+  instructions consist in a few bits representing an opcode, and bits forming
+  the operands for the instruction, whose size and position depend on the
+  opcode and on the number of literals copied by previous instruction. The
+  operands are used to indicate :
+    - a distance when copying data from the dictionary (past output buffer)
+    - a length (number of bytes to copy from dictionary)
+    - the number of literals to copy, which is retained in variable "state"
+      as a piece of information for next instructions.
+  Optionally depending on the opcode and operands, extra data may follow. These
+  extra data can be a complement for the operand (eg: a length or a distance
+  encoded on larger values), or a literal to be copied to the output buffer.
+  The first byte of the block follows a different encoding from other bytes, it
+  seems to be optimized for literal use only, since there is no dictionary yet
+  prior to that byte.
+  Lengths are always encoded on a variable size starting with a small number
+  of bits in the operand. If the number of bits isn't enough to represent the
+  length, up to 255 may be added in increments by consuming more bytes with a
+  rate of at most 255 per extra byte (thus the compression ratio cannot exceed
+  around 255:1). The variable length encoding using #bits is always the same :
+       length = byte & ((1 << #bits) - 1)
+       if (!length) {
+               length = ((1 << #bits) - 1)
+               length += 255*(number of zero bytes)
+               length += first-non-zero-byte
+       }
+       length += constant (generally 2 or 3)
+  For references to the dictionary, distances are relative to the output
+  pointer. Distances are encoded using very few bits belonging to certain
+  ranges, resulting in multiple copy instructions using different encodings.
+  Certain encodings involve one extra byte, others involve two extra bytes
+  forming a little-endian 16-bit quantity (marked LE16 below).
+  After any instruction except the large literal copy, 0, 1, 2 or 3 literals
+  are copied before starting the next instruction. The number of literals that
+  were copied may change the meaning and behaviour of the next instruction. In
+  practice, only one instruction needs to know whether 0, less than 4, or more
+  literals were copied. This is the information stored in the <state> variable
+  in this implementation. This number of immediate literals to be copied is
+  generally encoded in the last two bits of the instruction but may also be
+  taken from the last two bits of an extra operand (eg: distance).
+  End of stream is declared when a block copy of distance 0 is seen. Only one
+  instruction may encode this distance (0001HLLL), it takes one LE16 operand
+  for the distance, thus requiring 3 bytes.
+  IMPORTANT NOTE : in the code some length checks are missing because certain
+  instructions are called under the assumption that a certain number of bytes
+  follow because it has already been garanteed before parsing the instructions.
+  They just have to "refill" this credit if they consume extra bytes. This is
+  an implementation design choice independant on the algorithm or encoding.
+Byte sequences
+  First byte encoding :
+      0..17   : follow regular instruction encoding, see below. It is worth
+                noting that codes 16 and 17 will represent a block copy from
+                the dictionary which is empty, and that they will always be
+                invalid at this place.
+      18..21  : copy 0..3 literals
+                state = (byte - 17) = 0..3  [ copy <state> literals ]
+                skip byte
+      22..255 : copy literal string
+                length = (byte - 17) = 4..238
+                state = 4 [ don't copy extra literals ]
+                skip byte
+  Instruction encoding :
+      0 0 0 0 X X X X  (0..15)
+        Depends on the number of literals copied by the last instruction.
+        If last instruction did not copy any literal (state == 0), this
+        encoding will be a copy of 4 or more literal, and must be interpreted
+        like this :
+           0 0 0 0 L L L L  (0..15)  : copy long literal string
+           length = 3 + (L ?: 15 + (zero_bytes * 255) + non_zero_byte)
+           state = 4  (no extra literals are copied)
+        If last instruction used to copy between 1 to 3 literals (encoded in
+        the instruction's opcode or distance), the instruction is a copy of a
+        2-byte block from the dictionary within a 1kB distance. It is worth
+        noting that this instruction provides little savings since it uses 2
+        bytes to encode a copy of 2 other bytes but it encodes the number of
+        following literals for free. It must be interpreted like this :
+           0 0 0 0 D D S S  (0..15)  : copy 2 bytes from <= 1kB distance
+           length = 2
+           state = S (copy S literals after this block)
+         Always followed by exactly one byte : H H H H H H H H
+           distance = (H << 2) + D + 1
+        If last instruction used to copy 4 or more literals (as detected by
+        state == 4), the instruction becomes a copy of a 3-byte block from the
+        dictionary from a 2..3kB distance, and must be interpreted like this :
+           0 0 0 0 D D S S  (0..15)  : copy 3 bytes from 2..3 kB distance
+           length = 3
+           state = S (copy S literals after this block)
+         Always followed by exactly one byte : H H H H H H H H
+           distance = (H << 2) + D + 2049
+      0 0 0 1 H L L L  (16..31)
+           Copy of a block within 16..48kB distance (preferably less than 10B)
+           length = 2 + (L ?: 7 + (zero_bytes * 255) + non_zero_byte)
+        Always followed by exactly one LE16 :  D D D D D D D D : D D D D D D S S
+           distance = 16384 + (H << 14) + D
+           state = S (copy S literals after this block)
+           End of stream is reached if distance == 16384
+      0 0 1 L L L L L  (32..63)
+           Copy of small block within 16kB distance (preferably less than 34B)
+           length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
+        Always followed by exactly one LE16 :  D D D D D D D D : D D D D D D S S
+           distance = D + 1
+           state = S (copy S literals after this block)
+      0 1 L D D D S S  (64..127)
+           Copy 3-4 bytes from block within 2kB distance
+           state = S (copy S literals after this block)
+           length = 3 + L
+         Always followed by exactly one byte : H H H H H H H H
+           distance = (H << 3) + D + 1
+      1 L L D D D S S  (128..255)
+           Copy 5-8 bytes from block within 2kB distance
+           state = S (copy S literals after this block)
+           length = 5 + L
+         Always followed by exactly one byte : H H H H H H H H
+           distance = (H << 3) + D + 1
+Authors
+  This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
+  analysis of the decompression code available in Linux 3.16-rc5. The code is
+  tricky, it is possible that this document contains mistakes or that a few
+  corner cases were overlooked. In any case, please report any doubt, fix, or
+  proposed updates to the author(s) so that the document can be updated.
author	Willy Tarreau <w@1wt.eu>	2014-09-27 06:31:35 -0400
committer	Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-09-28 05:08:00 -0400
commit	d98a0526434d27e261f622cf9d2e0028b5ff1a00 (patch)
tree	14e08dd8af525f9d0686be31aef2af11ee10f252 /Documentation/lzo.txt
parent	9e82bf014195d6f0054982c463575cdce24292be (diff)

diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt new file mode 100644 index 000000000000..ea45dd3901e3 --- /dev/null +++ b/Documentation/lzo.txt
@@ -0,0 +1,164 @@
	1
	2	LZO stream format as understood by Linux's LZO decompressor
	3	===========================================================
	4
	5	Introduction
	6
	7	This is not a specification. No specification seems to be publicly available
	8	for the LZO stream format. This document describes what input format the LZO
	9	decompressor as implemented in the Linux kernel understands. The file subject
	10	of this analysis is lib/lzo/lzo1x_decompress_safe.c. No analysis was made on
	11	the compressor nor on any other implementations though it seems likely that
	12	the format matches the standard one. The purpose of this document is to
	13	better understand what the code does in order to propose more efficient fixes
	14	for future bug reports.
	15
	16	Description
	17
	18	The stream is composed of a series of instructions, operands, and data. The
	19	instructions consist in a few bits representing an opcode, and bits forming
	20	the operands for the instruction, whose size and position depend on the
	21	opcode and on the number of literals copied by previous instruction. The
	22	operands are used to indicate :
	23
	24	- a distance when copying data from the dictionary (past output buffer)
	25	- a length (number of bytes to copy from dictionary)
	26	- the number of literals to copy, which is retained in variable "state"
	27	as a piece of information for next instructions.
	28
	29	Optionally depending on the opcode and operands, extra data may follow. These
	30	extra data can be a complement for the operand (eg: a length or a distance
	31	encoded on larger values), or a literal to be copied to the output buffer.
	32
	33	The first byte of the block follows a different encoding from other bytes, it
	34	seems to be optimized for literal use only, since there is no dictionary yet
	35	prior to that byte.
	36
	37	Lengths are always encoded on a variable size starting with a small number
	38	of bits in the operand. If the number of bits isn't enough to represent the
	39	length, up to 255 may be added in increments by consuming more bytes with a
	40	rate of at most 255 per extra byte (thus the compression ratio cannot exceed
	41	around 255:1). The variable length encoding using #bits is always the same :
	42
	43	length = byte & ((1 << #bits) - 1)
	44	if (!length) {
	45	length = ((1 << #bits) - 1)
	46	length += 255*(number of zero bytes)
	47	length += first-non-zero-byte
	48	}
	49	length += constant (generally 2 or 3)
	50
	51	For references to the dictionary, distances are relative to the output
	52	pointer. Distances are encoded using very few bits belonging to certain
	53	ranges, resulting in multiple copy instructions using different encodings.
	54	Certain encodings involve one extra byte, others involve two extra bytes
	55	forming a little-endian 16-bit quantity (marked LE16 below).
	56
	57	After any instruction except the large literal copy, 0, 1, 2 or 3 literals
	58	are copied before starting the next instruction. The number of literals that
	59	were copied may change the meaning and behaviour of the next instruction. In
	60	practice, only one instruction needs to know whether 0, less than 4, or more
	61	literals were copied. This is the information stored in the <state> variable
	62	in this implementation. This number of immediate literals to be copied is
	63	generally encoded in the last two bits of the instruction but may also be
	64	taken from the last two bits of an extra operand (eg: distance).
	65
	66	End of stream is declared when a block copy of distance 0 is seen. Only one
	67	instruction may encode this distance (0001HLLL), it takes one LE16 operand
	68	for the distance, thus requiring 3 bytes.
	69
	70	IMPORTANT NOTE : in the code some length checks are missing because certain
	71	instructions are called under the assumption that a certain number of bytes
	72	follow because it has already been garanteed before parsing the instructions.
	73	They just have to "refill" this credit if they consume extra bytes. This is
	74	an implementation design choice independant on the algorithm or encoding.
	75
	76	Byte sequences
	77
	78	First byte encoding :
	79
	80	0..17 : follow regular instruction encoding, see below. It is worth
	81	noting that codes 16 and 17 will represent a block copy from
	82	the dictionary which is empty, and that they will always be
	83	invalid at this place.
	84
	85	18..21 : copy 0..3 literals
	86	state = (byte - 17) = 0..3 [ copy <state> literals ]
	87	skip byte
	88
	89	22..255 : copy literal string
	90	length = (byte - 17) = 4..238
	91	state = 4 [ don't copy extra literals ]
	92	skip byte
	93
	94	Instruction encoding :
	95
	96	0 0 0 0 X X X X (0..15)
	97	Depends on the number of literals copied by the last instruction.
	98	If last instruction did not copy any literal (state == 0), this
	99	encoding will be a copy of 4 or more literal, and must be interpreted
	100	like this :
	101
	102	0 0 0 0 L L L L (0..15) : copy long literal string
	103	length = 3 + (L ?: 15 + (zero_bytes * 255) + non_zero_byte)
	104	state = 4 (no extra literals are copied)
	105
	106	If last instruction used to copy between 1 to 3 literals (encoded in
	107	the instruction's opcode or distance), the instruction is a copy of a
	108	2-byte block from the dictionary within a 1kB distance. It is worth
	109	noting that this instruction provides little savings since it uses 2
	110	bytes to encode a copy of 2 other bytes but it encodes the number of
	111	following literals for free. It must be interpreted like this :
	112
	113	0 0 0 0 D D S S (0..15) : copy 2 bytes from <= 1kB distance
	114	length = 2
	115	state = S (copy S literals after this block)
	116	Always followed by exactly one byte : H H H H H H H H
	117	distance = (H << 2) + D + 1
	118
	119	If last instruction used to copy 4 or more literals (as detected by
	120	state == 4), the instruction becomes a copy of a 3-byte block from the
	121	dictionary from a 2..3kB distance, and must be interpreted like this :
	122
	123	0 0 0 0 D D S S (0..15) : copy 3 bytes from 2..3 kB distance
	124	length = 3
	125	state = S (copy S literals after this block)
	126	Always followed by exactly one byte : H H H H H H H H
	127	distance = (H << 2) + D + 2049
	128
	129	0 0 0 1 H L L L (16..31)
	130	Copy of a block within 16..48kB distance (preferably less than 10B)
	131	length = 2 + (L ?: 7 + (zero_bytes * 255) + non_zero_byte)
	132	Always followed by exactly one LE16 : D D D D D D D D : D D D D D D S S
	133	distance = 16384 + (H << 14) + D
	134	state = S (copy S literals after this block)
	135	End of stream is reached if distance == 16384
	136
	137	0 0 1 L L L L L (32..63)
	138	Copy of small block within 16kB distance (preferably less than 34B)
	139	length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
	140	Always followed by exactly one LE16 : D D D D D D D D : D D D D D D S S
	141	distance = D + 1
	142	state = S (copy S literals after this block)
	143
	144	0 1 L D D D S S (64..127)
	145	Copy 3-4 bytes from block within 2kB distance
	146	state = S (copy S literals after this block)
	147	length = 3 + L
	148	Always followed by exactly one byte : H H H H H H H H
	149	distance = (H << 3) + D + 1
	150
	151	1 L L D D D S S (128..255)
	152	Copy 5-8 bytes from block within 2kB distance
	153	state = S (copy S literals after this block)
	154	length = 5 + L
	155	Always followed by exactly one byte : H H H H H H H H
	156	distance = (H << 3) + D + 1
	157
	158	Authors
	159
	160	This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
	161	analysis of the decompression code available in Linux 3.16-rc5. The code is
	162	tricky, it is possible that this document contains mistakes or that a few
	163	corner cases were overlooked. In any case, please report any doubt, fix, or
	164	proposed updates to the author(s) so that the document can be updated.