aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBob Pearson <rpearson@systemfabricworks.com>2012-03-23 18:02:22 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2012-03-23 19:58:37 -0400
commitfbedceb10066430b925cf43fbf926e8abb9e2359 (patch)
treeea4f9453fd810c82c106df1e5b5932894ddcadd5
parente30c7a8fcf2d5bba53ea07047b1a0f9161da1078 (diff)
crc32: move long comment about crc32 fundamentals to Documentation/
Move a long comment from lib/crc32.c to Documentation/crc32.txt where it will more likely get read. Edited the resulting document to add an explanation of the slicing-by-n algorithm. [djwong@us.ibm.com: minor changelog tweaks] [akpm@linux-foundation.org: fix typo, per George] Signed-off-by: George Spelvin <linux@horizon.com> Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com> Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--Documentation/00-INDEX2
-rw-r--r--Documentation/crc32.txt182
-rw-r--r--lib/crc32.c129
3 files changed, 186 insertions, 127 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index a1a64327288..2214f123a97 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -104,6 +104,8 @@ cpuidle/
104 - info on CPU_IDLE, CPU idle state management subsystem. 104 - info on CPU_IDLE, CPU idle state management subsystem.
105cputopology.txt 105cputopology.txt
106 - documentation on how CPU topology info is exported via sysfs. 106 - documentation on how CPU topology info is exported via sysfs.
107crc32.txt
108 - brief tutorial on CRC computation
107cris/ 109cris/
108 - directory with info about Linux on CRIS architecture. 110 - directory with info about Linux on CRIS architecture.
109crypto/ 111crypto/
diff --git a/Documentation/crc32.txt b/Documentation/crc32.txt
new file mode 100644
index 00000000000..a08a7dd9d62
--- /dev/null
+++ b/Documentation/crc32.txt
@@ -0,0 +1,182 @@
1A brief CRC tutorial.
2
3A CRC is a long-division remainder. You add the CRC to the message,
4and the whole thing (message+CRC) is a multiple of the given
5CRC polynomial. To check the CRC, you can either check that the
6CRC matches the recomputed value, *or* you can check that the
7remainder computed on the message+CRC is 0. This latter approach
8is used by a lot of hardware implementations, and is why so many
9protocols put the end-of-frame flag after the CRC.
10
11It's actually the same long division you learned in school, except that
12- We're working in binary, so the digits are only 0 and 1, and
13- When dividing polynomials, there are no carries. Rather than add and
14 subtract, we just xor. Thus, we tend to get a bit sloppy about
15 the difference between adding and subtracting.
16
17Like all division, the remainder is always smaller than the divisor.
18To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial.
19Since it's 33 bits long, bit 32 is always going to be set, so usually the
20CRC is written in hex with the most significant bit omitted. (If you're
21familiar with the IEEE 754 floating-point format, it's the same idea.)
22
23Note that a CRC is computed over a string of *bits*, so you have
24to decide on the endianness of the bits within each byte. To get
25the best error-detecting properties, this should correspond to the
26order they're actually sent. For example, standard RS-232 serial is
27little-endian; the most significant bit (sometimes used for parity)
28is sent last. And when appending a CRC word to a message, you should
29do it in the right order, matching the endianness.
30
31Just like with ordinary division, you proceed one digit (bit) at a time.
32Each step of the division you take one more digit (bit) of the dividend
33and append it to the current remainder. Then you figure out the
34appropriate multiple of the divisor to subtract to being the remainder
35back into range. In binary, this is easy - it has to be either 0 or 1,
36and to make the XOR cancel, it's just a copy of bit 32 of the remainder.
37
38When computing a CRC, we don't care about the quotient, so we can
39throw the quotient bit away, but subtract the appropriate multiple of
40the polynomial from the remainder and we're back to where we started,
41ready to process the next bit.
42
43A big-endian CRC written this way would be coded like:
44for (i = 0; i < input_bits; i++) {
45 multiple = remainder & 0x80000000 ? CRCPOLY : 0;
46 remainder = (remainder << 1 | next_input_bit()) ^ multiple;
47}
48
49Notice how, to get at bit 32 of the shifted remainder, we look
50at bit 31 of the remainder *before* shifting it.
51
52But also notice how the next_input_bit() bits we're shifting into
53the remainder don't actually affect any decision-making until
5432 bits later. Thus, the first 32 cycles of this are pretty boring.
55Also, to add the CRC to a message, we need a 32-bit-long hole for it at
56the end, so we have to add 32 extra cycles shifting in zeros at the
57end of every message,
58
59These details lead to a standard trick: rearrange merging in the
60next_input_bit() until the moment it's needed. Then the first 32 cycles
61can be precomputed, and merging in the final 32 zero bits to make room
62for the CRC can be skipped entirely. This changes the code to:
63
64for (i = 0; i < input_bits; i++) {
65 remainder ^= next_input_bit() << 31;
66 multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
67 remainder = (remainder << 1) ^ multiple;
68}
69
70With this optimization, the little-endian code is particularly simple:
71for (i = 0; i < input_bits; i++) {
72 remainder ^= next_input_bit();
73 multiple = (remainder & 1) ? CRCPOLY : 0;
74 remainder = (remainder >> 1) ^ multiple;
75}
76
77The most significant coefficient of the remainder polynomial is stored
78in the least significant bit of the binary "remainder" variable.
79The other details of endianness have been hidden in CRCPOLY (which must
80be bit-reversed) and next_input_bit().
81
82As long as next_input_bit is returning the bits in a sensible order, we don't
83*have* to wait until the last possible moment to merge in additional bits.
84We can do it 8 bits at a time rather than 1 bit at a time:
85for (i = 0; i < input_bytes; i++) {
86 remainder ^= next_input_byte() << 24;
87 for (j = 0; j < 8; j++) {
88 multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
89 remainder = (remainder << 1) ^ multiple;
90 }
91}
92
93Or in little-endian:
94for (i = 0; i < input_bytes; i++) {
95 remainder ^= next_input_byte();
96 for (j = 0; j < 8; j++) {
97 multiple = (remainder & 1) ? CRCPOLY : 0;
98 remainder = (remainder >> 1) ^ multiple;
99 }
100}
101
102If the input is a multiple of 32 bits, you can even XOR in a 32-bit
103word at a time and increase the inner loop count to 32.
104
105You can also mix and match the two loop styles, for example doing the
106bulk of a message byte-at-a-time and adding bit-at-a-time processing
107for any fractional bytes at the end.
108
109To reduce the number of conditional branches, software commonly uses
110the byte-at-a-time table method, popularized by Dilip V. Sarwate,
111"Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM
112v.31 no.8 (August 1998) p. 1008-1013.
113
114Here, rather than just shifting one bit of the remainder to decide
115in the correct multiple to subtract, we can shift a byte at a time.
116This produces a 40-bit (rather than a 33-bit) intermediate remainder,
117and the correct multiple of the polynomial to subtract is found using
118a 256-entry lookup table indexed by the high 8 bits.
119
120(The table entries are simply the CRC-32 of the given one-byte messages.)
121
122When space is more constrained, smaller tables can be used, e.g. two
1234-bit shifts followed by a lookup in a 16-entry table.
124
125It is not practical to process much more than 8 bits at a time using this
126technique, because tables larger than 256 entries use too much memory and,
127more importantly, too much of the L1 cache.
128
129To get higher software performance, a "slicing" technique can be used.
130See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm",
131ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf
132
133This does not change the number of table lookups, but does increase
134the parallelism. With the classic Sarwate algorithm, each table lookup
135must be completed before the index of the next can be computed.
136
137A "slicing by 2" technique would shift the remainder 16 bits at a time,
138producing a 48-bit intermediate remainder. Rather than doing a single
139lookup in a 65536-entry table, the two high bytes are looked up in
140two different 256-entry tables. Each contains the remainder required
141to cancel out the corresponding byte. The tables are different because the
142polynomials to cancel are different. One has non-zero coefficients from
143x^32 to x^39, while the other goes from x^40 to x^47.
144
145Since modern processors can handle many parallel memory operations, this
146takes barely longer than a single table look-up and thus performs almost
147twice as fast as the basic Sarwate algorithm.
148
149This can be extended to "slicing by 4" using 4 256-entry tables.
150Each step, 32 bits of data is fetched, XORed with the CRC, and the result
151broken into bytes and looked up in the tables. Because the 32-bit shift
152leaves the low-order bits of the intermediate remainder zero, the
153final CRC is simply the XOR of the 4 table look-ups.
154
155But this still enforces sequential execution: a second group of table
156look-ups cannot begin until the previous groups 4 table look-ups have all
157been completed. Thus, the processor's load/store unit is sometimes idle.
158
159To make maximum use of the processor, "slicing by 8" performs 8 look-ups
160in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed
161with 64 bits of input data. What is important to note is that 4 of
162those 8 bytes are simply copies of the input data; they do not depend
163on the previous CRC at all. Thus, those 4 table look-ups may commence
164immediately, without waiting for the previous loop iteration.
165
166By always having 4 loads in flight, a modern superscalar processor can
167be kept busy and make full use of its L1 cache.
168
169Two more details about CRC implementation in the real world:
170
171Normally, appending zero bits to a message which is already a multiple
172of a polynomial produces a larger multiple of that polynomial. Thus,
173a basic CRC will not detect appended zero bits (or bytes). To enable
174a CRC to detect this condition, it's common to invert the CRC before
175appending it. This makes the remainder of the message+crc come out not
176as zero, but some fixed non-zero value. (The CRC of the inversion
177pattern, 0xffffffff.)
178
179The same problem applies to zero bits prepended to the message, and a
180similar solution is used. Instead of starting the CRC computation with
181a remainder of 0, an initial remainder of all ones is used. As long as
182you start the same way on decoding, it doesn't make a difference.
diff --git a/lib/crc32.c b/lib/crc32.c
index ffea0c99a1f..c3ce94a06db 100644
--- a/lib/crc32.c
+++ b/lib/crc32.c
@@ -20,6 +20,8 @@
20 * Version 2. See the file COPYING for more details. 20 * Version 2. See the file COPYING for more details.
21 */ 21 */
22 22
23/* see: Documentation/crc32.txt for a description of algorithms */
24
23#include <linux/crc32.h> 25#include <linux/crc32.h>
24#include <linux/kernel.h> 26#include <linux/kernel.h>
25#include <linux/module.h> 27#include <linux/module.h>
@@ -209,133 +211,6 @@ u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
209EXPORT_SYMBOL(crc32_le); 211EXPORT_SYMBOL(crc32_le);
210EXPORT_SYMBOL(crc32_be); 212EXPORT_SYMBOL(crc32_be);
211 213
212/*
213 * A brief CRC tutorial.
214 *
215 * A CRC is a long-division remainder. You add the CRC to the message,
216 * and the whole thing (message+CRC) is a multiple of the given
217 * CRC polynomial. To check the CRC, you can either check that the
218 * CRC matches the recomputed value, *or* you can check that the
219 * remainder computed on the message+CRC is 0. This latter approach
220 * is used by a lot of hardware implementations, and is why so many
221 * protocols put the end-of-frame flag after the CRC.
222 *
223 * It's actually the same long division you learned in school, except that
224 * - We're working in binary, so the digits are only 0 and 1, and
225 * - When dividing polynomials, there are no carries. Rather than add and
226 * subtract, we just xor. Thus, we tend to get a bit sloppy about
227 * the difference between adding and subtracting.
228 *
229 * A 32-bit CRC polynomial is actually 33 bits long. But since it's
230 * 33 bits long, bit 32 is always going to be set, so usually the CRC
231 * is written in hex with the most significant bit omitted. (If you're
232 * familiar with the IEEE 754 floating-point format, it's the same idea.)
233 *
234 * Note that a CRC is computed over a string of *bits*, so you have
235 * to decide on the endianness of the bits within each byte. To get
236 * the best error-detecting properties, this should correspond to the
237 * order they're actually sent. For example, standard RS-232 serial is
238 * little-endian; the most significant bit (sometimes used for parity)
239 * is sent last. And when appending a CRC word to a message, you should
240 * do it in the right order, matching the endianness.
241 *
242 * Just like with ordinary division, the remainder is always smaller than
243 * the divisor (the CRC polynomial) you're dividing by. Each step of the
244 * division, you take one more digit (bit) of the dividend and append it
245 * to the current remainder. Then you figure out the appropriate multiple
246 * of the divisor to subtract to being the remainder back into range.
247 * In binary, it's easy - it has to be either 0 or 1, and to make the
248 * XOR cancel, it's just a copy of bit 32 of the remainder.
249 *
250 * When computing a CRC, we don't care about the quotient, so we can
251 * throw the quotient bit away, but subtract the appropriate multiple of
252 * the polynomial from the remainder and we're back to where we started,
253 * ready to process the next bit.
254 *
255 * A big-endian CRC written this way would be coded like:
256 * for (i = 0; i < input_bits; i++) {
257 * multiple = remainder & 0x80000000 ? CRCPOLY : 0;
258 * remainder = (remainder << 1 | next_input_bit()) ^ multiple;
259 * }
260 * Notice how, to get at bit 32 of the shifted remainder, we look
261 * at bit 31 of the remainder *before* shifting it.
262 *
263 * But also notice how the next_input_bit() bits we're shifting into
264 * the remainder don't actually affect any decision-making until
265 * 32 bits later. Thus, the first 32 cycles of this are pretty boring.
266 * Also, to add the CRC to a message, we need a 32-bit-long hole for it at
267 * the end, so we have to add 32 extra cycles shifting in zeros at the
268 * end of every message,
269 *
270 * So the standard trick is to rearrage merging in the next_input_bit()
271 * until the moment it's needed. Then the first 32 cycles can be precomputed,
272 * and merging in the final 32 zero bits to make room for the CRC can be
273 * skipped entirely.
274 * This changes the code to:
275 * for (i = 0; i < input_bits; i++) {
276 * remainder ^= next_input_bit() << 31;
277 * multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
278 * remainder = (remainder << 1) ^ multiple;
279 * }
280 * With this optimization, the little-endian code is simpler:
281 * for (i = 0; i < input_bits; i++) {
282 * remainder ^= next_input_bit();
283 * multiple = (remainder & 1) ? CRCPOLY : 0;
284 * remainder = (remainder >> 1) ^ multiple;
285 * }
286 *
287 * Note that the other details of endianness have been hidden in CRCPOLY
288 * (which must be bit-reversed) and next_input_bit().
289 *
290 * However, as long as next_input_bit is returning the bits in a sensible
291 * order, we can actually do the merging 8 or more bits at a time rather
292 * than one bit at a time:
293 * for (i = 0; i < input_bytes; i++) {
294 * remainder ^= next_input_byte() << 24;
295 * for (j = 0; j < 8; j++) {
296 * multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
297 * remainder = (remainder << 1) ^ multiple;
298 * }
299 * }
300 * Or in little-endian:
301 * for (i = 0; i < input_bytes; i++) {
302 * remainder ^= next_input_byte();
303 * for (j = 0; j < 8; j++) {
304 * multiple = (remainder & 1) ? CRCPOLY : 0;
305 * remainder = (remainder << 1) ^ multiple;
306 * }
307 * }
308 * If the input is a multiple of 32 bits, you can even XOR in a 32-bit
309 * word at a time and increase the inner loop count to 32.
310 *
311 * You can also mix and match the two loop styles, for example doing the
312 * bulk of a message byte-at-a-time and adding bit-at-a-time processing
313 * for any fractional bytes at the end.
314 *
315 * The only remaining optimization is to the byte-at-a-time table method.
316 * Here, rather than just shifting one bit of the remainder to decide
317 * in the correct multiple to subtract, we can shift a byte at a time.
318 * This produces a 40-bit (rather than a 33-bit) intermediate remainder,
319 * but again the multiple of the polynomial to subtract depends only on
320 * the high bits, the high 8 bits in this case.
321 *
322 * The multiple we need in that case is the low 32 bits of a 40-bit
323 * value whose high 8 bits are given, and which is a multiple of the
324 * generator polynomial. This is simply the CRC-32 of the given
325 * one-byte message.
326 *
327 * Two more details: normally, appending zero bits to a message which
328 * is already a multiple of a polynomial produces a larger multiple of that
329 * polynomial. To enable a CRC to detect this condition, it's common to
330 * invert the CRC before appending it. This makes the remainder of the
331 * message+crc come out not as zero, but some fixed non-zero value.
332 *
333 * The same problem applies to zero bits prepended to the message, and
334 * a similar solution is used. Instead of starting with a remainder of
335 * 0, an initial remainder of all ones is used. As long as you start
336 * the same way on decoding, it doesn't make a difference.
337 */
338
339#ifdef UNITTEST 214#ifdef UNITTEST
340 215
341#include <stdlib.h> 216#include <stdlib.h>