diff options
| author | frans <fransmeulenbroeks@gmail.com> | 2008-08-15 17:14:31 -0400 |
|---|---|---|
| committer | David Woodhouse <David.Woodhouse@intel.com> | 2008-08-16 05:55:33 -0400 |
| commit | e6cf5df1838c28bb060ac45b5585e48e71bbc740 (patch) | |
| tree | b1333e4664fce7dd3c58dd879192e085cb1c2066 /Documentation/mtd | |
| parent | 782b7a367d81da005d93b28cb00f9ae086773c24 (diff) | |
[MTD] [NAND] nand_ecc.c: rewrite for improved performance
This patch improves the performance of the ecc generation code by a
factor of 18 on an INTEL D920 CPU, a factor of 7 on MIPS and a factor of 5
on ARM (NSLU2)
Signed-off-by: Frans Meulenbroeks <fransmeulenbroeks@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Diffstat (limited to 'Documentation/mtd')
| -rw-r--r-- | Documentation/mtd/nand_ecc.txt | 714 |
1 files changed, 714 insertions, 0 deletions
diff --git a/Documentation/mtd/nand_ecc.txt b/Documentation/mtd/nand_ecc.txt new file mode 100644 index 000000000000..bdf93b7f0f24 --- /dev/null +++ b/Documentation/mtd/nand_ecc.txt | |||
| @@ -0,0 +1,714 @@ | |||
| 1 | Introduction | ||
| 2 | ============ | ||
| 3 | |||
| 4 | Having looked at the linux mtd/nand driver and more specific at nand_ecc.c | ||
| 5 | I felt there was room for optimisation. I bashed the code for a few hours | ||
| 6 | performing tricks like table lookup removing superfluous code etc. | ||
| 7 | After that the speed was increased by 35-40%. | ||
| 8 | Still I was not too happy as I felt there was additional room for improvement. | ||
| 9 | |||
| 10 | Bad! I was hooked. | ||
| 11 | I decided to annotate my steps in this file. Perhaps it is useful to someone | ||
| 12 | or someone learns something from it. | ||
| 13 | |||
| 14 | |||
| 15 | The problem | ||
| 16 | =========== | ||
| 17 | |||
| 18 | NAND flash (at least SLC one) typically has sectors of 256 bytes. | ||
| 19 | However NAND flash is not extremely reliable so some error detection | ||
| 20 | (and sometimes correction) is needed. | ||
| 21 | |||
| 22 | This is done by means of a Hamming code. I'll try to explain it in | ||
| 23 | laymans terms (and apologies to all the pro's in the field in case I do | ||
| 24 | not use the right terminology, my coding theory class was almost 30 | ||
| 25 | years ago, and I must admit it was not one of my favourites). | ||
| 26 | |||
| 27 | As I said before the ecc calculation is performed on sectors of 256 | ||
| 28 | bytes. This is done by calculating several parity bits over the rows and | ||
| 29 | columns. The parity used is even parity which means that the parity bit = 1 | ||
| 30 | if the data over which the parity is calculated is 1 and the parity bit = 0 | ||
| 31 | if the data over which the parity is calculated is 0. So the total | ||
| 32 | number of bits over the data over which the parity is calculated + the | ||
| 33 | parity bit is even. (see wikipedia if you can't follow this). | ||
| 34 | Parity is often calculated by means of an exclusive or operation, | ||
| 35 | sometimes also referred to as xor. In C the operator for xor is ^ | ||
| 36 | |||
| 37 | Back to ecc. | ||
| 38 | Let's give a small figure: | ||
| 39 | |||
| 40 | byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 | ||
| 41 | byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 | ||
| 42 | byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 | ||
| 43 | byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 | ||
| 44 | byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 | ||
| 45 | .... | ||
| 46 | byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 | ||
| 47 | byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 | ||
| 48 | cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 | ||
| 49 | cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 | ||
| 50 | cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 | ||
| 51 | |||
| 52 | This figure represents a sector of 256 bytes. | ||
| 53 | cp is my abbreviaton for column parity, rp for row parity. | ||
| 54 | |||
| 55 | Let's start to explain column parity. | ||
| 56 | cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. | ||
| 57 | so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. | ||
| 58 | Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. | ||
| 59 | cp2 is the parity over bit0, bit1, bit4 and bit5 | ||
| 60 | cp3 is the parity over bit2, bit3, bit6 and bit7. | ||
| 61 | cp4 is the parity over bit0, bit1, bit2 and bit3. | ||
| 62 | cp5 is the parity over bit4, bit5, bit6 and bit7. | ||
| 63 | Note that each of cp0 .. cp5 is exactly one bit. | ||
| 64 | |||
| 65 | Row parity actually works almost the same. | ||
| 66 | rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) | ||
| 67 | rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) | ||
| 68 | rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... | ||
| 69 | (so handle two bytes, then skip 2 bytes). | ||
| 70 | rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) | ||
| 71 | for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. | ||
| 72 | so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) | ||
| 73 | and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. | ||
| 74 | The story now becomes quite boring. I guess you get the idea. | ||
| 75 | rp6 covers 8 bytes then skips 8 etc | ||
| 76 | rp7 skips 8 bytes then covers 8 etc | ||
| 77 | rp8 covers 16 bytes then skips 16 etc | ||
| 78 | rp9 skips 16 bytes then covers 16 etc | ||
| 79 | rp10 covers 32 bytes then skips 32 etc | ||
| 80 | rp11 skips 32 bytes then covers 32 etc | ||
| 81 | rp12 covers 64 bytes then skips 64 etc | ||
| 82 | rp13 skips 64 bytes then covers 64 etc | ||
| 83 | rp14 covers 128 bytes then skips 128 | ||
| 84 | rp15 skips 128 bytes then covers 128 | ||
| 85 | |||
| 86 | In the end the parity bits are grouped together in three bytes as | ||
| 87 | follows: | ||
| 88 | ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 | ||
| 89 | ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 | ||
| 90 | ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 | ||
| 91 | ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 | ||
| 92 | |||
| 93 | I detected after writing this that ST application note AN1823 | ||
| 94 | (http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much | ||
| 95 | nicer picture.(but they use line parity as term where I use row parity) | ||
| 96 | Oh well, I'm graphically challenged, so suffer with me for a moment :-) | ||
| 97 | And I could not reuse the ST picture anyway for copyright reasons. | ||
| 98 | |||
| 99 | |||
| 100 | Attempt 0 | ||
| 101 | ========= | ||
| 102 | |||
| 103 | Implementing the parity calculation is pretty simple. | ||
| 104 | In C pseudocode: | ||
| 105 | for (i = 0; i < 256; i++) | ||
| 106 | { | ||
| 107 | if (i & 0x01) | ||
| 108 | rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | ||
| 109 | else | ||
| 110 | rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | ||
| 111 | if (i & 0x02) | ||
| 112 | rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; | ||
| 113 | else | ||
| 114 | rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; | ||
| 115 | if (i & 0x04) | ||
| 116 | rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; | ||
| 117 | else | ||
| 118 | rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; | ||
| 119 | if (i & 0x08) | ||
| 120 | rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; | ||
| 121 | else | ||
| 122 | rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; | ||
| 123 | if (i & 0x10) | ||
| 124 | rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; | ||
| 125 | else | ||
| 126 | rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; | ||
| 127 | if (i & 0x20) | ||
| 128 | rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; | ||
| 129 | else | ||
| 130 | rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; | ||
| 131 | if (i & 0x40) | ||
| 132 | rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; | ||
| 133 | else | ||
| 134 | rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; | ||
| 135 | if (i & 0x80) | ||
| 136 | rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; | ||
| 137 | else | ||
| 138 | rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; | ||
| 139 | cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; | ||
| 140 | cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; | ||
| 141 | cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; | ||
| 142 | cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 | ||
| 143 | cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 | ||
| 144 | cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 | ||
| 145 | } | ||
| 146 | |||
| 147 | |||
| 148 | Analysis 0 | ||
| 149 | ========== | ||
| 150 | |||
| 151 | C does have bitwise operators but not really operators to do the above | ||
| 152 | efficiently (and most hardware has no such instructions either). | ||
| 153 | Therefore without implementing this it was clear that the code above was | ||
| 154 | not going to bring me a Nobel prize :-) | ||
| 155 | |||
| 156 | Fortunately the exclusive or operation is commutative, so we can combine | ||
| 157 | the values in any order. So instead of calculating all the bits | ||
| 158 | individually, let us try to rearrange things. | ||
| 159 | For the column parity this is easy. We can just xor the bytes and in the | ||
| 160 | end filter out the relevant bits. This is pretty nice as it will bring | ||
| 161 | all cp calculation out of the if loop. | ||
| 162 | |||
| 163 | Similarly we can first xor the bytes for the various rows. | ||
| 164 | This leads to: | ||
| 165 | |||
| 166 | |||
| 167 | Attempt 1 | ||
| 168 | ========= | ||
| 169 | |||
| 170 | const char parity[256] = { | ||
| 171 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
| 172 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
| 173 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
| 174 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
| 175 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
| 176 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
| 177 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
| 178 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
| 179 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
| 180 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
| 181 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
| 182 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
| 183 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
| 184 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
| 185 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
| 186 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 | ||
| 187 | }; | ||
| 188 | |||
| 189 | void ecc1(const unsigned char *buf, unsigned char *code) | ||
| 190 | { | ||
| 191 | int i; | ||
| 192 | const unsigned char *bp = buf; | ||
| 193 | unsigned char cur; | ||
| 194 | unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | ||
| 195 | unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | ||
| 196 | unsigned char par; | ||
| 197 | |||
| 198 | par = 0; | ||
| 199 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | ||
| 200 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | ||
| 201 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | ||
| 202 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | ||
| 203 | |||
| 204 | for (i = 0; i < 256; i++) | ||
| 205 | { | ||
| 206 | cur = *bp++; | ||
| 207 | par ^= cur; | ||
| 208 | if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; | ||
| 209 | if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; | ||
| 210 | if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; | ||
| 211 | if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; | ||
| 212 | if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; | ||
| 213 | if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; | ||
| 214 | if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; | ||
| 215 | if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; | ||
| 216 | } | ||
| 217 | code[0] = | ||
| 218 | (parity[rp7] << 7) | | ||
| 219 | (parity[rp6] << 6) | | ||
| 220 | (parity[rp5] << 5) | | ||
| 221 | (parity[rp4] << 4) | | ||
| 222 | (parity[rp3] << 3) | | ||
| 223 | (parity[rp2] << 2) | | ||
| 224 | (parity[rp1] << 1) | | ||
| 225 | (parity[rp0]); | ||
| 226 | code[1] = | ||
| 227 | (parity[rp15] << 7) | | ||
| 228 | (parity[rp14] << 6) | | ||
| 229 | (parity[rp13] << 5) | | ||
| 230 | (parity[rp12] << 4) | | ||
| 231 | (parity[rp11] << 3) | | ||
| 232 | (parity[rp10] << 2) | | ||
| 233 | (parity[rp9] << 1) | | ||
| 234 | (parity[rp8]); | ||
| 235 | code[2] = | ||
| 236 | (parity[par & 0xf0] << 7) | | ||
| 237 | (parity[par & 0x0f] << 6) | | ||
| 238 | (parity[par & 0xcc] << 5) | | ||
| 239 | (parity[par & 0x33] << 4) | | ||
| 240 | (parity[par & 0xaa] << 3) | | ||
| 241 | (parity[par & 0x55] << 2); | ||
| 242 | code[0] = ~code[0]; | ||
| 243 | code[1] = ~code[1]; | ||
| 244 | code[2] = ~code[2]; | ||
| 245 | } | ||
| 246 | |||
| 247 | Still pretty straightforward. The last three invert statements are there to | ||
| 248 | give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash | ||
| 249 | all data is 0xff, so the checksum then matches. | ||
| 250 | |||
| 251 | I also introduced the parity lookup. I expected this to be the fastest | ||
| 252 | way to calculate the parity, but I will investigate alternatives later | ||
| 253 | on. | ||
| 254 | |||
| 255 | |||
| 256 | Analysis 1 | ||
| 257 | ========== | ||
| 258 | |||
| 259 | The code works, but is not terribly efficient. On my system it took | ||
| 260 | almost 4 times as much time as the linux driver code. But hey, if it was | ||
| 261 | *that* easy this would have been done long before. | ||
| 262 | No pain. no gain. | ||
| 263 | |||
| 264 | Fortunately there is plenty of room for improvement. | ||
| 265 | |||
| 266 | In step 1 we moved from bit-wise calculation to byte-wise calculation. | ||
| 267 | However in C we can also use the unsigned long data type and virtually | ||
| 268 | every modern microprocessor supports 32 bit operations, so why not try | ||
| 269 | to write our code in such a way that we process data in 32 bit chunks. | ||
| 270 | |||
| 271 | Of course this means some modification as the row parity is byte by | ||
| 272 | byte. A quick analysis: | ||
| 273 | for the column parity we use the par variable. When extending to 32 bits | ||
| 274 | we can in the end easily calculate p0 and p1 from it. | ||
| 275 | (because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 | ||
| 276 | respectively) | ||
| 277 | also rp2 and rp3 can be easily retrieved from par as rp3 covers the | ||
| 278 | first two bytes and rp2 the last two bytes. | ||
| 279 | |||
| 280 | Note that of course now the loop is executed only 64 times (256/4). | ||
| 281 | And note that care must taken wrt byte ordering. The way bytes are | ||
| 282 | ordered in a long is machine dependent, and might affect us. | ||
| 283 | Anyway, if there is an issue: this code is developed on x86 (to be | ||
| 284 | precise: a DELL PC with a D920 Intel CPU) | ||
| 285 | |||
| 286 | And of course the performance might depend on alignment, but I expect | ||
| 287 | that the I/O buffers in the nand driver are aligned properly (and | ||
| 288 | otherwise that should be fixed to get maximum performance). | ||
| 289 | |||
| 290 | Let's give it a try... | ||
| 291 | |||
| 292 | |||
| 293 | Attempt 2 | ||
| 294 | ========= | ||
| 295 | |||
| 296 | extern const char parity[256]; | ||
| 297 | |||
| 298 | void ecc2(const unsigned char *buf, unsigned char *code) | ||
| 299 | { | ||
| 300 | int i; | ||
| 301 | const unsigned long *bp = (unsigned long *)buf; | ||
| 302 | unsigned long cur; | ||
| 303 | unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | ||
| 304 | unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | ||
| 305 | unsigned long par; | ||
| 306 | |||
| 307 | par = 0; | ||
| 308 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | ||
| 309 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | ||
| 310 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | ||
| 311 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | ||
| 312 | |||
| 313 | for (i = 0; i < 64; i++) | ||
| 314 | { | ||
| 315 | cur = *bp++; | ||
| 316 | par ^= cur; | ||
| 317 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | ||
| 318 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | ||
| 319 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | ||
| 320 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | ||
| 321 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | ||
| 322 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | ||
| 323 | } | ||
| 324 | /* | ||
| 325 | we need to adapt the code generation for the fact that rp vars are now | ||
| 326 | long; also the column parity calculation needs to be changed. | ||
| 327 | we'll bring rp4 to 15 back to single byte entities by shifting and | ||
| 328 | xoring | ||
| 329 | */ | ||
| 330 | rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; | ||
| 331 | rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; | ||
| 332 | rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; | ||
| 333 | rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; | ||
| 334 | rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; | ||
| 335 | rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; | ||
| 336 | rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; | ||
| 337 | rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; | ||
| 338 | rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; | ||
| 339 | rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; | ||
| 340 | rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; | ||
| 341 | rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; | ||
| 342 | rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; | ||
| 343 | rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; | ||
| 344 | par ^= (par >> 16); | ||
| 345 | rp1 = (par >> 8); rp1 &= 0xff; | ||
| 346 | rp0 = (par & 0xff); | ||
| 347 | par ^= (par >> 8); par &= 0xff; | ||
| 348 | |||
| 349 | code[0] = | ||
| 350 | (parity[rp7] << 7) | | ||
| 351 | (parity[rp6] << 6) | | ||
| 352 | (parity[rp5] << 5) | | ||
| 353 | (parity[rp4] << 4) | | ||
| 354 | (parity[rp3] << 3) | | ||
| 355 | (parity[rp2] << 2) | | ||
| 356 | (parity[rp1] << 1) | | ||
| 357 | (parity[rp0]); | ||
| 358 | code[1] = | ||
| 359 | (parity[rp15] << 7) | | ||
| 360 | (parity[rp14] << 6) | | ||
| 361 | (parity[rp13] << 5) | | ||
| 362 | (parity[rp12] << 4) | | ||
| 363 | (parity[rp11] << 3) | | ||
| 364 | (parity[rp10] << 2) | | ||
| 365 | (parity[rp9] << 1) | | ||
| 366 | (parity[rp8]); | ||
| 367 | code[2] = | ||
| 368 | (parity[par & 0xf0] << 7) | | ||
| 369 | (parity[par & 0x0f] << 6) | | ||
| 370 | (parity[par & 0xcc] << 5) | | ||
| 371 | (parity[par & 0x33] << 4) | | ||
| 372 | (parity[par & 0xaa] << 3) | | ||
| 373 | (parity[par & 0x55] << 2); | ||
| 374 | code[0] = ~code[0]; | ||
| 375 | code[1] = ~code[1]; | ||
| 376 | code[2] = ~code[2]; | ||
| 377 | } | ||
| 378 | |||
| 379 | The parity array is not shown any more. Note also that for these | ||
| 380 | examples I kinda deviated from my regular programming style by allowing | ||
| 381 | multiple statements on a line, not using { } in then and else blocks | ||
| 382 | with only a single statement and by using operators like ^= | ||
| 383 | |||
| 384 | |||
| 385 | Analysis 2 | ||
| 386 | ========== | ||
| 387 | |||
| 388 | The code (of course) works, and hurray: we are a little bit faster than | ||
| 389 | the linux driver code (about 15%). But wait, don't cheer too quickly. | ||
| 390 | THere is more to be gained. | ||
| 391 | If we look at e.g. rp14 and rp15 we see that we either xor our data with | ||
| 392 | rp14 or with rp15. However we also have par which goes over all data. | ||
| 393 | This means there is no need to calculate rp14 as it can be calculated from | ||
| 394 | rp15 through rp14 = par ^ rp15; | ||
| 395 | (or if desired we can avoid calculating rp15 and calculate it from | ||
| 396 | rp14). That is why some places refer to inverse parity. | ||
| 397 | Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. | ||
| 398 | Effectively this means we can eliminate the else clause from the if | ||
| 399 | statements. Also we can optimise the calculation in the end a little bit | ||
| 400 | by going from long to byte first. Actually we can even avoid the table | ||
| 401 | lookups | ||
| 402 | |||
| 403 | Attempt 3 | ||
| 404 | ========= | ||
| 405 | |||
| 406 | Odd replaced: | ||
| 407 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | ||
| 408 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | ||
| 409 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | ||
| 410 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | ||
| 411 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | ||
| 412 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | ||
| 413 | with | ||
| 414 | if (i & 0x01) rp5 ^= cur; | ||
| 415 | if (i & 0x02) rp7 ^= cur; | ||
| 416 | if (i & 0x04) rp9 ^= cur; | ||
| 417 | if (i & 0x08) rp11 ^= cur; | ||
| 418 | if (i & 0x10) rp13 ^= cur; | ||
| 419 | if (i & 0x20) rp15 ^= cur; | ||
| 420 | |||
| 421 | and outside the loop added: | ||
| 422 | rp4 = par ^ rp5; | ||
| 423 | rp6 = par ^ rp7; | ||
| 424 | rp8 = par ^ rp9; | ||
| 425 | rp10 = par ^ rp11; | ||
| 426 | rp12 = par ^ rp13; | ||
| 427 | rp14 = par ^ rp15; | ||
| 428 | |||
| 429 | And after that the code takes about 30% more time, although the number of | ||
| 430 | statements is reduced. This is also reflected in the assembly code. | ||
| 431 | |||
| 432 | |||
| 433 | Analysis 3 | ||
| 434 | ========== | ||
| 435 | |||
| 436 | Very weird. Guess it has to do with caching or instruction parallellism | ||
| 437 | or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting | ||
| 438 | observation was that this one is only 30% slower (according to time) | ||
| 439 | executing the code as my 3Ghz D920 processor. | ||
| 440 | |||
| 441 | Well, it was expected not to be easy so maybe instead move to a | ||
| 442 | different track: let's move back to the code from attempt2 and do some | ||
| 443 | loop unrolling. This will eliminate a few if statements. I'll try | ||
| 444 | different amounts of unrolling to see what works best. | ||
| 445 | |||
| 446 | |||
| 447 | Attempt 4 | ||
| 448 | ========= | ||
| 449 | |||
| 450 | Unrolled the loop 1, 2, 3 and 4 times. | ||
| 451 | For 4 the code starts with: | ||
| 452 | |||
| 453 | for (i = 0; i < 4; i++) | ||
| 454 | { | ||
| 455 | cur = *bp++; | ||
| 456 | par ^= cur; | ||
| 457 | rp4 ^= cur; | ||
| 458 | rp6 ^= cur; | ||
| 459 | rp8 ^= cur; | ||
| 460 | rp10 ^= cur; | ||
| 461 | if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; | ||
| 462 | if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; | ||
| 463 | cur = *bp++; | ||
| 464 | par ^= cur; | ||
| 465 | rp5 ^= cur; | ||
| 466 | rp6 ^= cur; | ||
| 467 | ... | ||
| 468 | |||
| 469 | |||
| 470 | Analysis 4 | ||
| 471 | ========== | ||
| 472 | |||
| 473 | Unrolling once gains about 15% | ||
| 474 | Unrolling twice keeps the gain at about 15% | ||
| 475 | Unrolling three times gives a gain of 30% compared to attempt 2. | ||
| 476 | Unrolling four times gives a marginal improvement compared to unrolling | ||
| 477 | three times. | ||
| 478 | |||
| 479 | I decided to proceed with a four time unrolled loop anyway. It was my gut | ||
| 480 | feeling that in the next steps I would obtain additional gain from it. | ||
| 481 | |||
| 482 | The next step was triggered by the fact that par contains the xor of all | ||
| 483 | bytes and rp4 and rp5 each contain the xor of half of the bytes. | ||
| 484 | So in effect par = rp4 ^ rp5. But as xor is commutative we can also say | ||
| 485 | that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can | ||
| 486 | eliminate rp5 (or rp4, but I already foresaw another optimisation). | ||
| 487 | The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. | ||
| 488 | |||
| 489 | |||
| 490 | Attempt 5 | ||
| 491 | ========= | ||
| 492 | |||
| 493 | Effectively so all odd digit rp assignments in the loop were removed. | ||
| 494 | This included the else clause of the if statements. | ||
| 495 | Of course after the loop we need to correct things by adding code like: | ||
| 496 | rp5 = par ^ rp4; | ||
| 497 | Also the initial assignments (rp5 = 0; etc) could be removed. | ||
| 498 | Along the line I also removed the initialisation of rp0/1/2/3. | ||
| 499 | |||
| 500 | |||
| 501 | Analysis 5 | ||
| 502 | ========== | ||
| 503 | |||
| 504 | Measurements showed this was a good move. The run-time roughly halved | ||
| 505 | compared with attempt 4 with 4 times unrolled, and we only require 1/3rd | ||
| 506 | of the processor time compared to the current code in the linux kernel. | ||
| 507 | |||
| 508 | However, still I thought there was more. I didn't like all the if | ||
| 509 | statements. Why not keep a running parity and only keep the last if | ||
| 510 | statement. Time for yet another version! | ||
| 511 | |||
| 512 | |||
| 513 | Attempt 6 | ||
| 514 | ========= | ||
| 515 | |||
| 516 | THe code within the for loop was changed to: | ||
| 517 | |||
| 518 | for (i = 0; i < 4; i++) | ||
| 519 | { | ||
| 520 | cur = *bp++; tmppar = cur; rp4 ^= cur; | ||
| 521 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | ||
| 522 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
| 523 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | ||
| 524 | |||
| 525 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | ||
| 526 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
| 527 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
| 528 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | ||
| 529 | |||
| 530 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; | ||
| 531 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; | ||
| 532 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; | ||
| 533 | cur = *bp++; tmppar ^= cur; rp8 ^= cur; | ||
| 534 | |||
| 535 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | ||
| 536 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
| 537 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
| 538 | cur = *bp++; tmppar ^= cur; | ||
| 539 | |||
| 540 | par ^= tmppar; | ||
| 541 | if ((i & 0x1) == 0) rp12 ^= tmppar; | ||
| 542 | if ((i & 0x2) == 0) rp14 ^= tmppar; | ||
| 543 | } | ||
| 544 | |||
| 545 | As you can see tmppar is used to accumulate the parity within a for | ||
| 546 | iteration. In the last 3 statements is is added to par and, if needed, | ||
| 547 | to rp12 and rp14. | ||
| 548 | |||
| 549 | While making the changes I also found that I could exploit that tmppar | ||
| 550 | contains the running parity for this iteration. So instead of having: | ||
| 551 | rp4 ^= cur; rp6 = cur; | ||
| 552 | I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next | ||
| 553 | statement. A similar change was done for rp8 and rp10 | ||
| 554 | |||
| 555 | |||
| 556 | Analysis 6 | ||
| 557 | ========== | ||
| 558 | |||
| 559 | Measuring this code again showed big gain. When executing the original | ||
| 560 | linux code 1 million times, this took about 1 second on my system. | ||
| 561 | (using time to measure the performance). After this iteration I was back | ||
| 562 | to 0.075 sec. Actually I had to decide to start measuring over 10 | ||
| 563 | million interations in order not to loose too much accuracy. This one | ||
| 564 | definitely seemed to be the jackpot! | ||
| 565 | |||
| 566 | There is a little bit more room for improvement though. There are three | ||
| 567 | places with statements: | ||
| 568 | rp4 ^= cur; rp6 ^= cur; | ||
| 569 | It seems more efficient to also maintain a variable rp4_6 in the while | ||
| 570 | loop; This eliminates 3 statements per loop. Of course after the loop we | ||
| 571 | need to correct by adding: | ||
| 572 | rp4 ^= rp4_6; | ||
| 573 | rp6 ^= rp4_6 | ||
| 574 | Furthermore there are 4 sequential assingments to rp8. This can be | ||
| 575 | encoded slightly more efficient by saving tmppar before those 4 lines | ||
| 576 | and later do rp8 = rp8 ^ tmppar ^ notrp8; | ||
| 577 | (where notrp8 is the value of rp8 before those 4 lines). | ||
| 578 | Again a use of the commutative property of xor. | ||
| 579 | Time for a new test! | ||
| 580 | |||
| 581 | |||
| 582 | Attempt 7 | ||
| 583 | ========= | ||
| 584 | |||
| 585 | The new code now looks like: | ||
| 586 | |||
| 587 | for (i = 0; i < 4; i++) | ||
| 588 | { | ||
| 589 | cur = *bp++; tmppar = cur; rp4 ^= cur; | ||
| 590 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | ||
| 591 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
| 592 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | ||
| 593 | |||
| 594 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
| 595 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
| 596 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
| 597 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | ||
| 598 | |||
| 599 | notrp8 = tmppar; | ||
| 600 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
| 601 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
| 602 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
| 603 | cur = *bp++; tmppar ^= cur; | ||
| 604 | rp8 = rp8 ^ tmppar ^ notrp8; | ||
| 605 | |||
| 606 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
| 607 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
| 608 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
| 609 | cur = *bp++; tmppar ^= cur; | ||
| 610 | |||
| 611 | par ^= tmppar; | ||
| 612 | if ((i & 0x1) == 0) rp12 ^= tmppar; | ||
| 613 | if ((i & 0x2) == 0) rp14 ^= tmppar; | ||
| 614 | } | ||
| 615 | rp4 ^= rp4_6; | ||
| 616 | rp6 ^= rp4_6; | ||
| 617 | |||
| 618 | |||
| 619 | Not a big change, but every penny counts :-) | ||
| 620 | |||
| 621 | |||
| 622 | Analysis 7 | ||
| 623 | ========== | ||
| 624 | |||
| 625 | Acutally this made things worse. Not very much, but I don't want to move | ||
| 626 | into the wrong direction. Maybe something to investigate later. Could | ||
| 627 | have to do with caching again. | ||
| 628 | |||
| 629 | Guess that is what there is to win within the loop. Maybe unrolling one | ||
| 630 | more time will help. I'll keep the optimisations from 7 for now. | ||
| 631 | |||
| 632 | |||
| 633 | Attempt 8 | ||
| 634 | ========= | ||
| 635 | |||
| 636 | Unrolled the loop one more time. | ||
| 637 | |||
| 638 | |||
| 639 | Analysis 8 | ||
| 640 | ========== | ||
| 641 | |||
| 642 | This makes things worse. Let's stick with attempt 6 and continue from there. | ||
| 643 | Although it seems that the code within the loop cannot be optimised | ||
| 644 | further there is still room to optimize the generation of the ecc codes. | ||
| 645 | We can simply calcualate the total parity. If this is 0 then rp4 = rp5 | ||
| 646 | etc. If the parity is 1, then rp4 = !rp5; | ||
| 647 | But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits | ||
| 648 | in the result byte and then do something like | ||
| 649 | code[0] |= (code[0] << 1); | ||
| 650 | Lets test this. | ||
| 651 | |||
| 652 | |||
| 653 | Attempt 9 | ||
| 654 | ========= | ||
| 655 | |||
| 656 | Changed the code but again this slightly degrades performance. Tried all | ||
| 657 | kind of other things, like having dedicated parity arrays to avoid the | ||
| 658 | shift after parity[rp7] << 7; No gain. | ||
| 659 | Change the lookup using the parity array by using shift operators (e.g. | ||
| 660 | replace parity[rp7] << 7 with: | ||
| 661 | rp7 ^= (rp7 << 4); | ||
| 662 | rp7 ^= (rp7 << 2); | ||
| 663 | rp7 ^= (rp7 << 1); | ||
| 664 | rp7 &= 0x80; | ||
| 665 | No gain. | ||
| 666 | |||
| 667 | The only marginal change was inverting the parity bits, so we can remove | ||
| 668 | the last three invert statements. | ||
| 669 | |||
| 670 | Ah well, pity this does not deliver more. Then again 10 million | ||
| 671 | iterations using the linux driver code takes between 13 and 13.5 | ||
| 672 | seconds, whereas my code now takes about 0.73 seconds for those 10 | ||
| 673 | million iterations. So basically I've improved the performance by a | ||
| 674 | factor 18 on my system. Not that bad. Of course on different hardware | ||
| 675 | you will get different results. No warranties! | ||
| 676 | |||
| 677 | But of course there is no such thing as a free lunch. The codesize almost | ||
| 678 | tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. | ||
| 679 | |||
| 680 | |||
| 681 | Correcting errors | ||
| 682 | ================= | ||
| 683 | |||
| 684 | For correcting errors I again used the ST application note as a starter, | ||
| 685 | but I also peeked at the existing code. | ||
| 686 | The algorithm itself is pretty straightforward. Just xor the given and | ||
| 687 | the calculated ecc. If all bytes are 0 there is no problem. If 11 bits | ||
| 688 | are 1 we have one correctable bit error. If there is 1 bit 1, we have an | ||
| 689 | error in the given ecc code. | ||
| 690 | It proved to be fastest to do some table lookups. Performance gain | ||
| 691 | introduced by this is about a factor 2 on my system when a repair had to | ||
| 692 | be done, and 1% or so if no repair had to be done. | ||
| 693 | Code size increased from 330 bytes to 686 bytes for this function. | ||
| 694 | (gcc 4.2, -O3) | ||
| 695 | |||
| 696 | |||
| 697 | Conclusion | ||
| 698 | ========== | ||
| 699 | |||
| 700 | The gain when calculating the ecc is tremendous. Om my development hardware | ||
| 701 | a speedup of a factor of 18 for ecc calculation was achieved. On a test on an | ||
| 702 | embedded system with a MIPS core a factor 7 was obtained. | ||
| 703 | On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor | ||
| 704 | 5 (big endian mode, gcc 4.1.2, -O3) | ||
| 705 | For correction not much gain could be obtained (as bitflips are rare). Then | ||
| 706 | again there are also much less cycles spent there. | ||
| 707 | |||
| 708 | It seems there is not much more gain possible in this, at least when | ||
| 709 | programmed in C. Of course it might be possible to squeeze something more | ||
| 710 | out of it with an assembler program, but due to pipeline behaviour etc | ||
| 711 | this is very tricky (at least for intel hw). | ||
| 712 | |||
| 713 | Author: Frans Meulenbroeks | ||
| 714 | Copyright (C) 2008 Koninklijke Philips Electronics NV. | ||
