crc32: move long comment about crc32 fundamentals to Documentation/

Move a long comment from lib/crc32.c to Documentation/crc32.txt where it will more likely get read. Edited the resulting document to add an explanation of the slicing-by-n algorithm. [djwong@us.ibm.com: minor changelog tweaks] [akpm@linux-foundation.org: fix typo, per George] Signed-off-by: George Spelvin <linux@horizon.com> Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com> Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Bob Pearson <rpearson@systemfabricworks.com> 2012-03-23 18:02:22 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2012-03-23 19:58:37 -0400
commit: fbedceb10066430b925cf43fbf926e8abb9e2359 (patch)
tree: ea4f9453fd810c82c106df1e5b5932894ddcadd5 /Documentation
parent: e30c7a8fcf2d5bba53ea07047b1a0f9161da1078 (diff)
2 files changed, 184 insertions, 0 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index a1a643272883..2214f123a976 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -104,6 +104,8 @@ cpuidle/
        - info on CPU_IDLE, CPU idle state management subsystem.
 cputopology.txt
        - documentation on how CPU topology info is exported via sysfs.
+crc32.txt
+        - brief tutorial on CRC computation
 cris/
        - directory with info about Linux on CRIS architecture.
 crypto/
diff --git a/Documentation/crc32.txt b/Documentation/crc32.txt
new file mode 100644
index 000000000000..a08a7dd9d625
--- /dev/null
+++ b/Documentation/crc32.txt
@@ -0,0 +1,182 @@
+A brief CRC tutorial.
+A CRC is a long-division remainder.  You add the CRC to the message,
+and the whole thing (message+CRC) is a multiple of the given
+CRC polynomial.  To check the CRC, you can either check that the
+CRC matches the recomputed value, *or* you can check that the
+remainder computed on the message+CRC is 0.  This latter approach
+is used by a lot of hardware implementations, and is why so many
+protocols put the end-of-frame flag after the CRC.
+It's actually the same long division you learned in school, except that
+- We're working in binary, so the digits are only 0 and 1, and
+- When dividing polynomials, there are no carries.  Rather than add and
+  subtract, we just xor.  Thus, we tend to get a bit sloppy about
+  the difference between adding and subtracting.
+Like all division, the remainder is always smaller than the divisor.
+To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial.
+Since it's 33 bits long, bit 32 is always going to be set, so usually the
+CRC is written in hex with the most significant bit omitted.  (If you're
+familiar with the IEEE 754 floating-point format, it's the same idea.)
+Note that a CRC is computed over a string of *bits*, so you have
+to decide on the endianness of the bits within each byte.  To get
+the best error-detecting properties, this should correspond to the
+order they're actually sent.  For example, standard RS-232 serial is
+little-endian; the most significant bit (sometimes used for parity)
+is sent last.  And when appending a CRC word to a message, you should
+do it in the right order, matching the endianness.
+Just like with ordinary division, you proceed one digit (bit) at a time.
+Each step of the division you take one more digit (bit) of the dividend
+and append it to the current remainder.  Then you figure out the
+appropriate multiple of the divisor to subtract to being the remainder
+back into range.  In binary, this is easy - it has to be either 0 or 1,
+and to make the XOR cancel, it's just a copy of bit 32 of the remainder.
+When computing a CRC, we don't care about the quotient, so we can
+throw the quotient bit away, but subtract the appropriate multiple of
+the polynomial from the remainder and we're back to where we started,
+ready to process the next bit.
+A big-endian CRC written this way would be coded like:
+for (i = 0; i < input_bits; i++) {
+        multiple = remainder & 0x80000000 ? CRCPOLY : 0;
+        remainder = (remainder << 1 | next_input_bit()) ^ multiple;
+}
+Notice how, to get at bit 32 of the shifted remainder, we look
+at bit 31 of the remainder *before* shifting it.
+But also notice how the next_input_bit() bits we're shifting into
+the remainder don't actually affect any decision-making until
+32 bits later.  Thus, the first 32 cycles of this are pretty boring.
+Also, to add the CRC to a message, we need a 32-bit-long hole for it at
+the end, so we have to add 32 extra cycles shifting in zeros at the
+end of every message,
+These details lead to a standard trick: rearrange merging in the
+next_input_bit() until the moment it's needed.  Then the first 32 cycles
+can be precomputed, and merging in the final 32 zero bits to make room
+for the CRC can be skipped entirely.  This changes the code to:
+for (i = 0; i < input_bits; i++) {
+        remainder ^= next_input_bit() << 31;
+        multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
+        remainder = (remainder << 1) ^ multiple;
+}
+With this optimization, the little-endian code is particularly simple:
+for (i = 0; i < input_bits; i++) {
+        remainder ^= next_input_bit();
+        multiple = (remainder & 1) ? CRCPOLY : 0;
+        remainder = (remainder >> 1) ^ multiple;
+}
+The most significant coefficient of the remainder polynomial is stored
+in the least significant bit of the binary "remainder" variable.
+The other details of endianness have been hidden in CRCPOLY (which must
+be bit-reversed) and next_input_bit().
+As long as next_input_bit is returning the bits in a sensible order, we don't
+*have* to wait until the last possible moment to merge in additional bits.
+We can do it 8 bits at a time rather than 1 bit at a time:
+for (i = 0; i < input_bytes; i++) {
+        remainder ^= next_input_byte() << 24;
+        for (j = 0; j < 8; j++) {
+                multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
+                remainder = (remainder << 1) ^ multiple;
+        }
+}
+Or in little-endian:
+for (i = 0; i < input_bytes; i++) {
+        remainder ^= next_input_byte();
+        for (j = 0; j < 8; j++) {
+                multiple = (remainder & 1) ? CRCPOLY : 0;
+                remainder = (remainder >> 1) ^ multiple;
+        }
+}
+If the input is a multiple of 32 bits, you can even XOR in a 32-bit
+word at a time and increase the inner loop count to 32.
+You can also mix and match the two loop styles, for example doing the
+bulk of a message byte-at-a-time and adding bit-at-a-time processing
+for any fractional bytes at the end.
+To reduce the number of conditional branches, software commonly uses
+the byte-at-a-time table method, popularized by Dilip V. Sarwate,
+"Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM
+v.31 no.8 (August 1998) p. 1008-1013.
+Here, rather than just shifting one bit of the remainder to decide
+in the correct multiple to subtract, we can shift a byte at a time.
+This produces a 40-bit (rather than a 33-bit) intermediate remainder,
+and the correct multiple of the polynomial to subtract is found using
+a 256-entry lookup table indexed by the high 8 bits.
+(The table entries are simply the CRC-32 of the given one-byte messages.)
+When space is more constrained, smaller tables can be used, e.g. two
+4-bit shifts followed by a lookup in a 16-entry table.
+It is not practical to process much more than 8 bits at a time using this
+technique, because tables larger than 256 entries use too much memory and,
+more importantly, too much of the L1 cache.
+To get higher software performance, a "slicing" technique can be used.
+See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm",
+ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf
+This does not change the number of table lookups, but does increase
+the parallelism.  With the classic Sarwate algorithm, each table lookup
+must be completed before the index of the next can be computed.
+A "slicing by 2" technique would shift the remainder 16 bits at a time,
+producing a 48-bit intermediate remainder.  Rather than doing a single
+lookup in a 65536-entry table, the two high bytes are looked up in
+two different 256-entry tables.  Each contains the remainder required
+to cancel out the corresponding byte.  The tables are different because the
+polynomials to cancel are different.  One has non-zero coefficients from
+x^32 to x^39, while the other goes from x^40 to x^47.
+Since modern processors can handle many parallel memory operations, this
+takes barely longer than a single table look-up and thus performs almost
+twice as fast as the basic Sarwate algorithm.
+This can be extended to "slicing by 4" using 4 256-entry tables.
+Each step, 32 bits of data is fetched, XORed with the CRC, and the result
+broken into bytes and looked up in the tables.  Because the 32-bit shift
+leaves the low-order bits of the intermediate remainder zero, the
+final CRC is simply the XOR of the 4 table look-ups.
+But this still enforces sequential execution: a second group of table
+look-ups cannot begin until the previous groups 4 table look-ups have all
+been completed.  Thus, the processor's load/store unit is sometimes idle.
+To make maximum use of the processor, "slicing by 8" performs 8 look-ups
+in parallel.  Each step, the 32-bit CRC is shifted 64 bits and XORed
+with 64 bits of input data.  What is important to note is that 4 of
+those 8 bytes are simply copies of the input data; they do not depend
+on the previous CRC at all.  Thus, those 4 table look-ups may commence
+immediately, without waiting for the previous loop iteration.
+By always having 4 loads in flight, a modern superscalar processor can
+be kept busy and make full use of its L1 cache.
+Two more details about CRC implementation in the real world:
+Normally, appending zero bits to a message which is already a multiple
+of a polynomial produces a larger multiple of that polynomial.  Thus,
+a basic CRC will not detect appended zero bits (or bytes).  To enable
+a CRC to detect this condition, it's common to invert the CRC before
+appending it.  This makes the remainder of the message+crc come out not
+as zero, but some fixed non-zero value.  (The CRC of the inversion
+pattern, 0xffffffff.)
+The same problem applies to zero bits prepended to the message, and a
+similar solution is used.  Instead of starting the CRC computation with
+a remainder of 0, an initial remainder of all ones is used.  As long as
+you start the same way on decoding, it doesn't make a difference.
author	Bob Pearson <rpearson@systemfabricworks.com>	2012-03-23 18:02:22 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2012-03-23 19:58:37 -0400
commit	fbedceb10066430b925cf43fbf926e8abb9e2359 (patch)
tree	ea4f9453fd810c82c106df1e5b5932894ddcadd5 /Documentation
parent	e30c7a8fcf2d5bba53ea07047b1a0f9161da1078 (diff)

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index a1a643272883..2214f123a976 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX
@@ -104,6 +104,8 @@ cpuidle/
104	- info on CPU_IDLE, CPU idle state management subsystem.	104	- info on CPU_IDLE, CPU idle state management subsystem.
105	cputopology.txt	105	cputopology.txt
106	- documentation on how CPU topology info is exported via sysfs.	106	- documentation on how CPU topology info is exported via sysfs.
		107	crc32.txt
		108	- brief tutorial on CRC computation
107	cris/	109	cris/
108	- directory with info about Linux on CRIS architecture.	110	- directory with info about Linux on CRIS architecture.
109	crypto/	111	crypto/


diff --git a/Documentation/crc32.txt b/Documentation/crc32.txt new file mode 100644 index 000000000000..a08a7dd9d625 --- /dev/null +++ b/Documentation/crc32.txt
@@ -0,0 +1,182 @@
		1	A brief CRC tutorial.
		2
		3	A CRC is a long-division remainder. You add the CRC to the message,
		4	and the whole thing (message+CRC) is a multiple of the given
		5	CRC polynomial. To check the CRC, you can either check that the
		6	CRC matches the recomputed value, or you can check that the
		7	remainder computed on the message+CRC is 0. This latter approach
		8	is used by a lot of hardware implementations, and is why so many
		9	protocols put the end-of-frame flag after the CRC.
		10
		11	It's actually the same long division you learned in school, except that
		12	- We're working in binary, so the digits are only 0 and 1, and
		13	- When dividing polynomials, there are no carries. Rather than add and
		14	subtract, we just xor. Thus, we tend to get a bit sloppy about
		15	the difference between adding and subtracting.
		16
		17	Like all division, the remainder is always smaller than the divisor.
		18	To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial.
		19	Since it's 33 bits long, bit 32 is always going to be set, so usually the
		20	CRC is written in hex with the most significant bit omitted. (If you're
		21	familiar with the IEEE 754 floating-point format, it's the same idea.)
		22
		23	Note that a CRC is computed over a string of bits, so you have
		24	to decide on the endianness of the bits within each byte. To get
		25	the best error-detecting properties, this should correspond to the
		26	order they're actually sent. For example, standard RS-232 serial is
		27	little-endian; the most significant bit (sometimes used for parity)
		28	is sent last. And when appending a CRC word to a message, you should
		29	do it in the right order, matching the endianness.
		30
		31	Just like with ordinary division, you proceed one digit (bit) at a time.
		32	Each step of the division you take one more digit (bit) of the dividend
		33	and append it to the current remainder. Then you figure out the
		34	appropriate multiple of the divisor to subtract to being the remainder
		35	back into range. In binary, this is easy - it has to be either 0 or 1,
		36	and to make the XOR cancel, it's just a copy of bit 32 of the remainder.
		37
		38	When computing a CRC, we don't care about the quotient, so we can
		39	throw the quotient bit away, but subtract the appropriate multiple of
		40	the polynomial from the remainder and we're back to where we started,
		41	ready to process the next bit.
		42
		43	A big-endian CRC written this way would be coded like:
		44	for (i = 0; i < input_bits; i++) {
		45	multiple = remainder & 0x80000000 ? CRCPOLY : 0;
		46	remainder = (remainder << 1 \| next_input_bit()) ^ multiple;
		47	}
		48
		49	Notice how, to get at bit 32 of the shifted remainder, we look
		50	at bit 31 of the remainder before shifting it.
		51
		52	But also notice how the next_input_bit() bits we're shifting into
		53	the remainder don't actually affect any decision-making until
		54	32 bits later. Thus, the first 32 cycles of this are pretty boring.
		55	Also, to add the CRC to a message, we need a 32-bit-long hole for it at
		56	the end, so we have to add 32 extra cycles shifting in zeros at the
		57	end of every message,
		58
		59	These details lead to a standard trick: rearrange merging in the
		60	next_input_bit() until the moment it's needed. Then the first 32 cycles
		61	can be precomputed, and merging in the final 32 zero bits to make room
		62	for the CRC can be skipped entirely. This changes the code to:
		63
		64	for (i = 0; i < input_bits; i++) {
		65	remainder ^= next_input_bit() << 31;
		66	multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
		67	remainder = (remainder << 1) ^ multiple;
		68	}
		69
		70	With this optimization, the little-endian code is particularly simple:
		71	for (i = 0; i < input_bits; i++) {
		72	remainder ^= next_input_bit();
		73	multiple = (remainder & 1) ? CRCPOLY : 0;
		74	remainder = (remainder >> 1) ^ multiple;
		75	}
		76
		77	The most significant coefficient of the remainder polynomial is stored
		78	in the least significant bit of the binary "remainder" variable.
		79	The other details of endianness have been hidden in CRCPOLY (which must
		80	be bit-reversed) and next_input_bit().
		81
		82	As long as next_input_bit is returning the bits in a sensible order, we don't
		83	have to wait until the last possible moment to merge in additional bits.
		84	We can do it 8 bits at a time rather than 1 bit at a time:
		85	for (i = 0; i < input_bytes; i++) {
		86	remainder ^= next_input_byte() << 24;
		87	for (j = 0; j < 8; j++) {
		88	multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
		89	remainder = (remainder << 1) ^ multiple;
		90	}
		91	}
		92
		93	Or in little-endian:
		94	for (i = 0; i < input_bytes; i++) {
		95	remainder ^= next_input_byte();
		96	for (j = 0; j < 8; j++) {
		97	multiple = (remainder & 1) ? CRCPOLY : 0;
		98	remainder = (remainder >> 1) ^ multiple;
		99	}
		100	}
		101
		102	If the input is a multiple of 32 bits, you can even XOR in a 32-bit
		103	word at a time and increase the inner loop count to 32.
		104
		105	You can also mix and match the two loop styles, for example doing the
		106	bulk of a message byte-at-a-time and adding bit-at-a-time processing
		107	for any fractional bytes at the end.
		108
		109	To reduce the number of conditional branches, software commonly uses
		110	the byte-at-a-time table method, popularized by Dilip V. Sarwate,
		111	"Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM
		112	v.31 no.8 (August 1998) p. 1008-1013.
		113
		114	Here, rather than just shifting one bit of the remainder to decide
		115	in the correct multiple to subtract, we can shift a byte at a time.
		116	This produces a 40-bit (rather than a 33-bit) intermediate remainder,
		117	and the correct multiple of the polynomial to subtract is found using
		118	a 256-entry lookup table indexed by the high 8 bits.
		119
		120	(The table entries are simply the CRC-32 of the given one-byte messages.)
		121
		122	When space is more constrained, smaller tables can be used, e.g. two
		123	4-bit shifts followed by a lookup in a 16-entry table.
		124
		125	It is not practical to process much more than 8 bits at a time using this
		126	technique, because tables larger than 256 entries use too much memory and,
		127	more importantly, too much of the L1 cache.
		128
		129	To get higher software performance, a "slicing" technique can be used.
		130	See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm",
		131	ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf
		132
		133	This does not change the number of table lookups, but does increase
		134	the parallelism. With the classic Sarwate algorithm, each table lookup
		135	must be completed before the index of the next can be computed.
		136
		137	A "slicing by 2" technique would shift the remainder 16 bits at a time,
		138	producing a 48-bit intermediate remainder. Rather than doing a single
		139	lookup in a 65536-entry table, the two high bytes are looked up in
		140	two different 256-entry tables. Each contains the remainder required
		141	to cancel out the corresponding byte. The tables are different because the
		142	polynomials to cancel are different. One has non-zero coefficients from
		143	x^32 to x^39, while the other goes from x^40 to x^47.
		144
		145	Since modern processors can handle many parallel memory operations, this
		146	takes barely longer than a single table look-up and thus performs almost
		147	twice as fast as the basic Sarwate algorithm.
		148
		149	This can be extended to "slicing by 4" using 4 256-entry tables.
		150	Each step, 32 bits of data is fetched, XORed with the CRC, and the result
		151	broken into bytes and looked up in the tables. Because the 32-bit shift
		152	leaves the low-order bits of the intermediate remainder zero, the
		153	final CRC is simply the XOR of the 4 table look-ups.
		154
		155	But this still enforces sequential execution: a second group of table
		156	look-ups cannot begin until the previous groups 4 table look-ups have all
		157	been completed. Thus, the processor's load/store unit is sometimes idle.
		158
		159	To make maximum use of the processor, "slicing by 8" performs 8 look-ups
		160	in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed
		161	with 64 bits of input data. What is important to note is that 4 of
		162	those 8 bytes are simply copies of the input data; they do not depend
		163	on the previous CRC at all. Thus, those 4 table look-ups may commence
		164	immediately, without waiting for the previous loop iteration.
		165
		166	By always having 4 loads in flight, a modern superscalar processor can
		167	be kept busy and make full use of its L1 cache.
		168
		169	Two more details about CRC implementation in the real world:
		170
		171	Normally, appending zero bits to a message which is already a multiple
		172	of a polynomial produces a larger multiple of that polynomial. Thus,
		173	a basic CRC will not detect appended zero bits (or bytes). To enable
		174	a CRC to detect this condition, it's common to invert the CRC before
		175	appending it. This makes the remainder of the message+crc come out not
		176	as zero, but some fixed non-zero value. (The CRC of the inversion
		177	pattern, 0xffffffff.)
		178
		179	The same problem applies to zero bits prepended to the message, and a
		180	similar solution is used. Instead of starting the CRC computation with
		181	a remainder of 0, an initial remainder of all ones is used. As long as
		182	you start the same way on decoding, it doesn't make a difference.