diff options
author | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
commit | 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch) | |
tree | 0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/crypto/descore-readme.txt |
Linux-2.6.12-rc2v2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!
Diffstat (limited to 'Documentation/crypto/descore-readme.txt')
-rw-r--r-- | Documentation/crypto/descore-readme.txt | 352 |
1 files changed, 352 insertions, 0 deletions
diff --git a/Documentation/crypto/descore-readme.txt b/Documentation/crypto/descore-readme.txt new file mode 100644 index 000000000000..166474c2ee0b --- /dev/null +++ b/Documentation/crypto/descore-readme.txt | |||
@@ -0,0 +1,352 @@ | |||
1 | Below is the orginal README file from the descore.shar package. | ||
2 | ------------------------------------------------------------------------------ | ||
3 | |||
4 | des - fast & portable DES encryption & decryption. | ||
5 | Copyright (C) 1992 Dana L. How | ||
6 | |||
7 | This program is free software; you can redistribute it and/or modify | ||
8 | it under the terms of the GNU Library General Public License as published by | ||
9 | the Free Software Foundation; either version 2 of the License, or | ||
10 | (at your option) any later version. | ||
11 | |||
12 | This program is distributed in the hope that it will be useful, | ||
13 | but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
14 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
15 | GNU Library General Public License for more details. | ||
16 | |||
17 | You should have received a copy of the GNU Library General Public License | ||
18 | along with this program; if not, write to the Free Software | ||
19 | Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. | ||
20 | |||
21 | Author's address: how@isl.stanford.edu | ||
22 | |||
23 | $Id: README,v 1.15 1992/05/20 00:25:32 how E $ | ||
24 | |||
25 | |||
26 | ==>> To compile after untarring/unsharring, just `make' <<== | ||
27 | |||
28 | |||
29 | This package was designed with the following goals: | ||
30 | 1. Highest possible encryption/decryption PERFORMANCE. | ||
31 | 2. PORTABILITY to any byte-addressable host with a 32bit unsigned C type | ||
32 | 3. Plug-compatible replacement for KERBEROS's low-level routines. | ||
33 | |||
34 | This second release includes a number of performance enhancements for | ||
35 | register-starved machines. My discussions with Richard Outerbridge, | ||
36 | 71755.204@compuserve.com, sparked a number of these enhancements. | ||
37 | |||
38 | To more rapidly understand the code in this package, inspect desSmallFips.i | ||
39 | (created by typing `make') BEFORE you tackle desCode.h. The latter is set | ||
40 | up in a parameterized fashion so it can easily be modified by speed-daemon | ||
41 | hackers in pursuit of that last microsecond. You will find it more | ||
42 | illuminating to inspect one specific implementation, | ||
43 | and then move on to the common abstract skeleton with this one in mind. | ||
44 | |||
45 | |||
46 | performance comparison to other available des code which i could | ||
47 | compile on a SPARCStation 1 (cc -O4, gcc -O2): | ||
48 | |||
49 | this code (byte-order independent): | ||
50 | 30us per encryption (options: 64k tables, no IP/FP) | ||
51 | 33us per encryption (options: 64k tables, FIPS standard bit ordering) | ||
52 | 45us per encryption (options: 2k tables, no IP/FP) | ||
53 | 48us per encryption (options: 2k tables, FIPS standard bit ordering) | ||
54 | 275us to set a new key (uses 1k of key tables) | ||
55 | this has the quickest encryption/decryption routines i've seen. | ||
56 | since i was interested in fast des filters rather than crypt(3) | ||
57 | and password cracking, i haven't really bothered yet to speed up | ||
58 | the key setting routine. also, i have no interest in re-implementing | ||
59 | all the other junk in the mit kerberos des library, so i've just | ||
60 | provided my routines with little stub interfaces so they can be | ||
61 | used as drop-in replacements with mit's code or any of the mit- | ||
62 | compatible packages below. (note that the first two timings above | ||
63 | are highly variable because of cache effects). | ||
64 | |||
65 | kerberos des replacement from australia (version 1.95): | ||
66 | 53us per encryption (uses 2k of tables) | ||
67 | 96us to set a new key (uses 2.25k of key tables) | ||
68 | so despite the author's inclusion of some of the performance | ||
69 | improvements i had suggested to him, this package's | ||
70 | encryption/decryption is still slower on the sparc and 68000. | ||
71 | more specifically, 19-40% slower on the 68020 and 11-35% slower | ||
72 | on the sparc, depending on the compiler; | ||
73 | in full gory detail (ALT_ECB is a libdes variant): | ||
74 | compiler machine desCore libdes ALT_ECB slower by | ||
75 | gcc 2.1 -O2 Sun 3/110 304 uS 369.5uS 461.8uS 22% | ||
76 | cc -O1 Sun 3/110 336 uS 436.6uS 399.3uS 19% | ||
77 | cc -O2 Sun 3/110 360 uS 532.4uS 505.1uS 40% | ||
78 | cc -O4 Sun 3/110 365 uS 532.3uS 505.3uS 38% | ||
79 | gcc 2.1 -O2 Sun 4/50 48 uS 53.4uS 57.5uS 11% | ||
80 | cc -O2 Sun 4/50 48 uS 64.6uS 64.7uS 35% | ||
81 | cc -O4 Sun 4/50 48 uS 64.7uS 64.9uS 35% | ||
82 | (my time measurements are not as accurate as his). | ||
83 | the comments in my first release of desCore on version 1.92: | ||
84 | 68us per encryption (uses 2k of tables) | ||
85 | 96us to set a new key (uses 2.25k of key tables) | ||
86 | this is a very nice package which implements the most important | ||
87 | of the optimizations which i did in my encryption routines. | ||
88 | it's a bit weak on common low-level optimizations which is why | ||
89 | it's 39%-106% slower. because he was interested in fast crypt(3) and | ||
90 | password-cracking applications, he also used the same ideas to | ||
91 | speed up the key-setting routines with impressive results. | ||
92 | (at some point i may do the same in my package). he also implements | ||
93 | the rest of the mit des library. | ||
94 | (code from eay@psych.psy.uq.oz.au via comp.sources.misc) | ||
95 | |||
96 | fast crypt(3) package from denmark: | ||
97 | the des routine here is buried inside a loop to do the | ||
98 | crypt function and i didn't feel like ripping it out and measuring | ||
99 | performance. his code takes 26 sparc instructions to compute one | ||
100 | des iteration; above, Quick (64k) takes 21 and Small (2k) takes 37. | ||
101 | he claims to use 280k of tables but the iteration calculation seems | ||
102 | to use only 128k. his tables and code are machine independent. | ||
103 | (code from glad@daimi.aau.dk via alt.sources or comp.sources.misc) | ||
104 | |||
105 | swedish reimplementation of Kerberos des library | ||
106 | 108us per encryption (uses 34k worth of tables) | ||
107 | 134us to set a new key (uses 32k of key tables to get this speed!) | ||
108 | the tables used seem to be machine-independent; | ||
109 | he seems to have included a lot of special case code | ||
110 | so that, e.g., `long' loads can be used instead of 4 `char' loads | ||
111 | when the machine's architecture allows it. | ||
112 | (code obtained from chalmers.se:pub/des) | ||
113 | |||
114 | crack 3.3c package from england: | ||
115 | as in crypt above, the des routine is buried in a loop. it's | ||
116 | also very modified for crypt. his iteration code uses 16k | ||
117 | of tables and appears to be slow. | ||
118 | (code obtained from aem@aber.ac.uk via alt.sources or comp.sources.misc) | ||
119 | |||
120 | ``highly optimized'' and tweaked Kerberos/Athena code (byte-order dependent): | ||
121 | 165us per encryption (uses 6k worth of tables) | ||
122 | 478us to set a new key (uses <1k of key tables) | ||
123 | so despite the comments in this code, it was possible to get | ||
124 | faster code AND smaller tables, as well as making the tables | ||
125 | machine-independent. | ||
126 | (code obtained from prep.ai.mit.edu) | ||
127 | |||
128 | UC Berkeley code (depends on machine-endedness): | ||
129 | 226us per encryption | ||
130 | 10848us to set a new key | ||
131 | table sizes are unclear, but they don't look very small | ||
132 | (code obtained from wuarchive.wustl.edu) | ||
133 | |||
134 | |||
135 | motivation and history | ||
136 | |||
137 | a while ago i wanted some des routines and the routines documented on sun's | ||
138 | man pages either didn't exist or dumped core. i had heard of kerberos, | ||
139 | and knew that it used des, so i figured i'd use its routines. but once | ||
140 | i got it and looked at the code, it really set off a lot of pet peeves - | ||
141 | it was too convoluted, the code had been written without taking | ||
142 | advantage of the regular structure of operations such as IP, E, and FP | ||
143 | (i.e. the author didn't sit down and think before coding), | ||
144 | it was excessively slow, the author had attempted to clarify the code | ||
145 | by adding MORE statements to make the data movement more `consistent' | ||
146 | instead of simplifying his implementation and cutting down on all data | ||
147 | movement (in particular, his use of L1, R1, L2, R2), and it was full of | ||
148 | idiotic `tweaks' for particular machines which failed to deliver significant | ||
149 | speedups but which did obfuscate everything. so i took the test data | ||
150 | from his verification program and rewrote everything else. | ||
151 | |||
152 | a while later i ran across the great crypt(3) package mentioned above. | ||
153 | the fact that this guy was computing 2 sboxes per table lookup rather | ||
154 | than one (and using a MUCH larger table in the process) emboldened me to | ||
155 | do the same - it was a trivial change from which i had been scared away | ||
156 | by the larger table size. in his case he didn't realize you don't need to keep | ||
157 | the working data in TWO forms, one for easy use of half the sboxes in | ||
158 | indexing, the other for easy use of the other half; instead you can keep | ||
159 | it in the form for the first half and use a simple rotate to get the other | ||
160 | half. this means i have (almost) half the data manipulation and half | ||
161 | the table size. in fairness though he might be encoding something particular | ||
162 | to crypt(3) in his tables - i didn't check. | ||
163 | |||
164 | i'm glad that i implemented it the way i did, because this C version is | ||
165 | portable (the ifdef's are performance enhancements) and it is faster | ||
166 | than versions hand-written in assembly for the sparc! | ||
167 | |||
168 | |||
169 | porting notes | ||
170 | |||
171 | one thing i did not want to do was write an enormous mess | ||
172 | which depended on endedness and other machine quirks, | ||
173 | and which necessarily produced different code and different lookup tables | ||
174 | for different machines. see the kerberos code for an example | ||
175 | of what i didn't want to do; all their endedness-specific `optimizations' | ||
176 | obfuscate the code and in the end were slower than a simpler machine | ||
177 | independent approach. however, there are always some portability | ||
178 | considerations of some kind, and i have included some options | ||
179 | for varying numbers of register variables. | ||
180 | perhaps some will still regard the result as a mess! | ||
181 | |||
182 | 1) i assume everything is byte addressable, although i don't actually | ||
183 | depend on the byte order, and that bytes are 8 bits. | ||
184 | i assume word pointers can be freely cast to and from char pointers. | ||
185 | note that 99% of C programs make these assumptions. | ||
186 | i always use unsigned char's if the high bit could be set. | ||
187 | 2) the typedef `word' means a 32 bit unsigned integral type. | ||
188 | if `unsigned long' is not 32 bits, change the typedef in desCore.h. | ||
189 | i assume sizeof(word) == 4 EVERYWHERE. | ||
190 | |||
191 | the (worst-case) cost of my NOT doing endedness-specific optimizations | ||
192 | in the data loading and storing code surrounding the key iterations | ||
193 | is less than 12%. also, there is the added benefit that | ||
194 | the input and output work areas do not need to be word-aligned. | ||
195 | |||
196 | |||
197 | OPTIONAL performance optimizations | ||
198 | |||
199 | 1) you should define one of `i386,' `vax,' `mc68000,' or `sparc,' | ||
200 | whichever one is closest to the capabilities of your machine. | ||
201 | see the start of desCode.h to see exactly what this selection implies. | ||
202 | note that if you select the wrong one, the des code will still work; | ||
203 | these are just performance tweaks. | ||
204 | 2) for those with functional `asm' keywords: you should change the | ||
205 | ROR and ROL macros to use machine rotate instructions if you have them. | ||
206 | this will save 2 instructions and a temporary per use, | ||
207 | or about 32 to 40 instructions per en/decryption. | ||
208 | note that gcc is smart enough to translate the ROL/R macros into | ||
209 | machine rotates! | ||
210 | |||
211 | these optimizations are all rather persnickety, yet with them you should | ||
212 | be able to get performance equal to assembly-coding, except that: | ||
213 | 1) with the lack of a bit rotate operator in C, rotates have to be synthesized | ||
214 | from shifts. so access to `asm' will speed things up if your machine | ||
215 | has rotates, as explained above in (3) (not necessary if you use gcc). | ||
216 | 2) if your machine has less than 12 32-bit registers i doubt your compiler will | ||
217 | generate good code. | ||
218 | `i386' tries to configure the code for a 386 by only declaring 3 registers | ||
219 | (it appears that gcc can use ebx, esi and edi to hold register variables). | ||
220 | however, if you like assembly coding, the 386 does have 7 32-bit registers, | ||
221 | and if you use ALL of them, use `scaled by 8' address modes with displacement | ||
222 | and other tricks, you can get reasonable routines for DesQuickCore... with | ||
223 | about 250 instructions apiece. For DesSmall... it will help to rearrange | ||
224 | des_keymap, i.e., now the sbox # is the high part of the index and | ||
225 | the 6 bits of data is the low part; it helps to exchange these. | ||
226 | since i have no way to conveniently test it i have not provided my | ||
227 | shoehorned 386 version. note that with this release of desCore, gcc is able | ||
228 | to put everything in registers(!), and generate about 370 instructions apiece | ||
229 | for the DesQuickCore... routines! | ||
230 | |||
231 | coding notes | ||
232 | |||
233 | the en/decryption routines each use 6 necessary register variables, | ||
234 | with 4 being actively used at once during the inner iterations. | ||
235 | if you don't have 4 register variables get a new machine. | ||
236 | up to 8 more registers are used to hold constants in some configurations. | ||
237 | |||
238 | i assume that the use of a constant is more expensive than using a register: | ||
239 | a) additionally, i have tried to put the larger constants in registers. | ||
240 | registering priority was by the following: | ||
241 | anything more than 12 bits (bad for RISC and CISC) | ||
242 | greater than 127 in value (can't use movq or byte immediate on CISC) | ||
243 | 9-127 (may not be able to use CISC shift immediate or add/sub quick), | ||
244 | 1-8 were never registered, being the cheapest constants. | ||
245 | b) the compiler may be too stupid to realize table and table+256 should | ||
246 | be assigned to different constant registers and instead repetitively | ||
247 | do the arithmetic, so i assign these to explicit `m' register variables | ||
248 | when possible and helpful. | ||
249 | |||
250 | i assume that indexing is cheaper or equivalent to auto increment/decrement, | ||
251 | where the index is 7 bits unsigned or smaller. | ||
252 | this assumption is reversed for 68k and vax. | ||
253 | |||
254 | i assume that addresses can be cheaply formed from two registers, | ||
255 | or from a register and a small constant. | ||
256 | for the 68000, the `two registers and small offset' form is used sparingly. | ||
257 | all index scaling is done explicitly - no hidden shifts by log2(sizeof). | ||
258 | |||
259 | the code is written so that even a dumb compiler | ||
260 | should never need more than one hidden temporary, | ||
261 | increasing the chance that everything will fit in the registers. | ||
262 | KEEP THIS MORE SUBTLE POINT IN MIND IF YOU REWRITE ANYTHING. | ||
263 | (actually, there are some code fragments now which do require two temps, | ||
264 | but fixing it would either break the structure of the macros or | ||
265 | require declaring another temporary). | ||
266 | |||
267 | |||
268 | special efficient data format | ||
269 | |||
270 | bits are manipulated in this arrangement most of the time (S7 S5 S3 S1): | ||
271 | 003130292827xxxx242322212019xxxx161514131211xxxx080706050403xxxx | ||
272 | (the x bits are still there, i'm just emphasizing where the S boxes are). | ||
273 | bits are rotated left 4 when computing S6 S4 S2 S0: | ||
274 | 282726252423xxxx201918171615xxxx121110090807xxxx040302010031xxxx | ||
275 | the rightmost two bits are usually cleared so the lower byte can be used | ||
276 | as an index into an sbox mapping table. the next two x'd bits are set | ||
277 | to various values to access different parts of the tables. | ||
278 | |||
279 | |||
280 | how to use the routines | ||
281 | |||
282 | datatypes: | ||
283 | pointer to 8 byte area of type DesData | ||
284 | used to hold keys and input/output blocks to des. | ||
285 | |||
286 | pointer to 128 byte area of type DesKeys | ||
287 | used to hold full 768-bit key. | ||
288 | must be long-aligned. | ||
289 | |||
290 | DesQuickInit() | ||
291 | call this before using any other routine with `Quick' in its name. | ||
292 | it generates the special 64k table these routines need. | ||
293 | DesQuickDone() | ||
294 | frees this table | ||
295 | |||
296 | DesMethod(m, k) | ||
297 | m points to a 128byte block, k points to an 8 byte des key | ||
298 | which must have odd parity (or -1 is returned) and which must | ||
299 | not be a (semi-)weak key (or -2 is returned). | ||
300 | normally DesMethod() returns 0. | ||
301 | m is filled in from k so that when one of the routines below | ||
302 | is called with m, the routine will act like standard des | ||
303 | en/decryption with the key k. if you use DesMethod, | ||
304 | you supply a standard 56bit key; however, if you fill in | ||
305 | m yourself, you will get a 768bit key - but then it won't | ||
306 | be standard. it's 768bits not 1024 because the least significant | ||
307 | two bits of each byte are not used. note that these two bits | ||
308 | will be set to magic constants which speed up the encryption/decryption | ||
309 | on some machines. and yes, each byte controls | ||
310 | a specific sbox during a specific iteration. | ||
311 | you really shouldn't use the 768bit format directly; i should | ||
312 | provide a routine that converts 128 6-bit bytes (specified in | ||
313 | S-box mapping order or something) into the right format for you. | ||
314 | this would entail some byte concatenation and rotation. | ||
315 | |||
316 | Des{Small|Quick}{Fips|Core}{Encrypt|Decrypt}(d, m, s) | ||
317 | performs des on the 8 bytes at s into the 8 bytes at d. (d,s: char *). | ||
318 | uses m as a 768bit key as explained above. | ||
319 | the Encrypt|Decrypt choice is obvious. | ||
320 | Fips|Core determines whether a completely standard FIPS initial | ||
321 | and final permutation is done; if not, then the data is loaded | ||
322 | and stored in a nonstandard bit order (FIPS w/o IP/FP). | ||
323 | Fips slows down Quick by 10%, Small by 9%. | ||
324 | Small|Quick determines whether you use the normal routine | ||
325 | or the crazy quick one which gobbles up 64k more of memory. | ||
326 | Small is 50% slower then Quick, but Quick needs 32 times as much | ||
327 | memory. Quick is included for programs that do nothing but DES, | ||
328 | e.g., encryption filters, etc. | ||
329 | |||
330 | |||
331 | Getting it to compile on your machine | ||
332 | |||
333 | there are no machine-dependencies in the code (see porting), | ||
334 | except perhaps the `now()' macro in desTest.c. | ||
335 | ALL generated tables are machine independent. | ||
336 | you should edit the Makefile with the appropriate optimization flags | ||
337 | for your compiler (MAX optimization). | ||
338 | |||
339 | |||
340 | Speeding up kerberos (and/or its des library) | ||
341 | |||
342 | note that i have included a kerberos-compatible interface in desUtil.c | ||
343 | through the functions des_key_sched() and des_ecb_encrypt(). | ||
344 | to use these with kerberos or kerberos-compatible code put desCore.a | ||
345 | ahead of the kerberos-compatible library on your linker's command line. | ||
346 | you should not need to #include desCore.h; just include the header | ||
347 | file provided with the kerberos library. | ||
348 | |||
349 | Other uses | ||
350 | |||
351 | the macros in desCode.h would be very useful for putting inline des | ||
352 | functions in more complicated encryption routines. | ||