summaryrefslogtreecommitdiffstats
path: root/baseline/source/huff_dec/compress.txt
diff options
context:
space:
mode:
Diffstat (limited to 'baseline/source/huff_dec/compress.txt')
-rw-r--r--baseline/source/huff_dec/compress.txt1107
1 files changed, 1107 insertions, 0 deletions
diff --git a/baseline/source/huff_dec/compress.txt b/baseline/source/huff_dec/compress.txt
new file mode 100644
index 0000000..5966bcf
--- /dev/null
+++ b/baseline/source/huff_dec/compress.txt
@@ -0,0 +1,1107 @@
1+===========================================================+
2| Introduction to the losslessy compression schemes |
3| Description of the codec source codes |
4+-----------------------------------------------------------+
5| From David Bourgin (E-mail: david.bourgin@ufrima.imag.fr) |
6| Date: 22/9/94 |
7+===========================================================+
8
9 ------ BE CARE ------
10This file (compress.txt) is copyrighted. (c) David Bourgin - 1994
11Permission to use this documentation for any purpose other than
12its incorporation into a commercial product is hereby granted without fee.
13Permission to copy and distribute this documentation only for non-commercial use
14is also granted without fee, provided, however, that the above copyright notice
15appears in all copies, that both that copyright notice and this permission notice appear in supporting documentation. The author makes no representations about
16the suitability of this documentation for any purpose. It is provided "as is"
17without express or implied warranty.
18
19The source codes you obtain with this file are *NOT* covered by the same
20copyright, because you can include them for both commercial and non-commercial
21use. See below for more infos.
22
23The source code files (codrl1.c, dcodrl1.c, codrle2.c, dcodrle2.c, codrle3.c,
24dcodrle3.c, codrle4.c, dcodrle4.c, codhuff.c, dcodhuff.c) are copyrighted.
25They have been uploaded on ftp in turing.imag.fr (129.88.31.7):/pub/compression
26on 22/5/94 and have been modified on 22/9/94.
27(c) David Bourgin - 1994
28The source codes I provide have no buggs (!) but being that I make them
29available for free I have some notes to make. They can change at any time
30without notice. I assume no responsability or liability for any errors or
31inaccurracies, make no warranty of any kind (express, implied or statutory)
32with respect to this publication and expressly disclaim any and all warranties
33of merchantability, fitness for particular purposes. Of course, if you have
34some problems to use the information presented here, I will try to help you if
35I can.
36
37If you include the source codes in your application, here are the conditions:
38- You have to put my name in the header of your source file (not in the
39excutable program if you don't want) (this item is a must)
40- I would like to see your resulting application, if possible (this item is not
41a must, because some applications must remain secret)
42- Whenever you gain money with your application, I would like to receive a very
43little part in order to be encouraged to update my source codes and to develop
44new schemes (this item is not a must)
45 ---------------------
46
47There are several means to compress data. Here, we are only going to deal with
48the losslessy schemes. These schemes are also called non-destructive because
49you always recover the initial data you had, and this, as soon as you need them.
50With losslessy schemes, you won't never lose any informations (except perhaps
51when you store or transmit your data but this is another problem...).
52
53In this introduction, we are going to see:
54- The RLE scheme (with different possible algorithms)
55- The Huffman schemes (dynamical scheme)
56- And the LZW scheme
57
58For the novice, a compresser is a program able to read several data (e.g. bytes)
59in input and to write several data in output. The data you obtain from the
60output (also called compressed data) will - of course - take less space than
61the the input data. This is true in most of cases, if the compresser works
62and if the type of the data is correct to be compressed with the given scheme.
63The codec (coder-decoder) enables you to save space on your hard disk and/or
64to save the communication costs because you always store/transmit the compressed
65data. You'll use the decompresser as soon as you need to recover your initial
66useful data. Note that the compressed data are useless if you have not
67the decoder...
68
69You are doubtless asking "How can I reduce the data size without losing some
70informations?". It's easy to answer to this question. I'll only take an example.
71I'm sure you have heard about the morse. This system established in the 19th
72century use a scheme very close to the huffman one. In the morse you encode
73the letters to transmit with two kinds of signs. If you encode these two sign
74possibilities in one bit, the symbol 'e' is transmitted in a single bit and
75the symbols 'y' and 'z' need four bits. Look at the symbols in the text you are
76reading, you'll fast understand the compression ratio...
77
78Important: The source codes associated to the algorithms I present are
79completely adaptative on what you need to compress. They all use basical
80macros on the top of the file. Usually the macros to change are:
81
82- beginning_of_data
83- end_of_data
84- read_byte
85- read_block
86- write_byte
87- write_block
88
89These allow the programmer to modify only a little part of the header
90of the source codes in order to compress as well memory as files.
91
92beginning_of_data(): Macro used to set the program so that the next read_byte()
93call will read the first byte to compress.
94end_of_data(): Returns a boolean to know whether there is no more bytes to read
95from the input stream. Return 0 if there is no more byte to compress, another
96non-zero value otherwise.
97read_byte(): Returns a byte read from the input stream if available.
98write_byte(x): Writes the byte 'x' to the output stream.
99read_block(...) and write_block(...): Same use as read_byte and write_byte(x)
100but these macros work on blocks of bytes and not only on a single byte.
101
102If you want to compress *from* the memory, before entering in a xxxcoding
103procedure ('xxx' is the actual extension to replace with a given codec), you
104have to add a pointer set up to the beginning of the zone to compress. Note
105that the following pointer 'source_memory_base' is not to add, it is just given
106here to specify a name to the address of the memory zone you are going to
107encode or decode. That is the same about source_memory_end which can be either
108a pointer to create or an existing pointer.
109
110unsigned char *source_memory_base, /* Base of the source memory */
111 *source_memory_end, /* Last address to read.
112 source_memory_end=source_memory_base+source_zone_length-1 */
113 *source_ptr; /* Used in the xxxcoding procedure */
114void pre_start()
115{ source_ptr=source_memory_base;
116 xxxcoding();
117}
118
119end_of_data() and read_byte() are also to modify to compress *from* memory:
120
121#define end_of_data() (source_ptr>source_memory_end)
122#define read_byte() (*(source_ptr++))
123
124If you want to compress *to* memory, before entering in a xxxcoding procedure
125('xxx' is the actual extension to replace with a given codec), you have to add
126a pointer. Note that the pointer 'dest_memory_base' is not to add, it is just
127given there to specify the address of the destination memory zone you are
128going to encode or decode.
129
130unsigned char *dest_memory_base, /* Base of the destination memory */
131 *dest_ptr; /* Used in the xxxcoding procedure */
132void pre_start()
133{ dest_ptr=dest_memory_base;
134 xxxcoding();
135}
136
137Of course, you can combine both from and to memory in the pre_start() procedure.
138The files dest_file and source_file handled in the main() function are
139to remove...
140
141void pre_start()
142{ source_ptr=source_memory_base;
143 dest_ptr=dest_memory_base;
144 xxxcoding();
145}
146
147In fact, to write to memory, the problem is in the write_byte(x) procedure.
148This problem exists because your destination zone can either be a static
149zone or a dynamically allocated zone. In the two cases, you have to check
150if there is no overflow, especially if the coder is not efficient and must
151produce more bytes than you reserved in memory.
152
153In the first case, with a *static* zone, write_byte(x) macro should look like
154that:
155
156unsigned long int dest_zone_length,
157 current_size;
158
159#define write_byte(x) { if (current_size==dest_zone_length) \
160 exit(1); \
161 dest_ptr[current_size++]=(unsigned char)(x); \
162 }
163
164In the static version, the pre_start() procedure is to modify as following:
165
166void pre_start()
167{ source_ptr=source_memory_base;
168 dest_ptr=dest_memory_base;
169 dest_zone_length=...; /* Set up to the actual destination zone length */
170 current_size=0; /* Number of written bytes */
171 xxxcoding();
172}
173Otherwise, dest_ptr is a zone created by the malloc instruction and you can try
174to resize the allocated zone with the realloc instruction. Note that I increment
175the zone one kilo-bytes by one kylo-bytes. You have to add two other variables:
176
177unsigned long int dest_zone_length,
178 current_size;
179
180#define write_byte(x) { if (current_size==dest_zone_length) \
181 { dest_zone_length += 1024; \
182 if ((dest_ptr=(unsigned char *)realloc(dest_ptr,dest_zone_length*sizeof(unsigned char)))==NULL) \
183 exit(1); /* You can't compress in memory \
184 => I exit but *you* can make a routine to swap on disk */ \
185 } \
186 dest_ptr[current_size++]=(unsigned char)(x); \
187 }
188
189With the dynamically allocated version, change the pre_start() routine as following:
190
191void pre_start()
192{ source_ptr=source_memory_base;
193 dest_ptr=dest_memory_base;
194 dest_zone_length=1024;
195 if ((dest_ptr=(unsigned char *)malloc(dest_zone_length*sizeof(unsigned char)))==NULL)
196 exit(1); /* You need at least 1 kb in the dynamical memory ! */
197 current_size=0; /* Number of written bytes */
198 xxxcoding();
199 /* Handle the bytes in dest_ptr but don't forget to free these bytes with:
200 free(dest_ptr);
201 */
202}
203
204The previously given macros work as:
205
206void demo() /* The file opening, closing and variables
207 must be set up by the calling procedure */
208{ unsigned char byte;
209 /* And not 'char byte' (!) */
210 while (!end_of_data())
211 { byte=read_byte();
212 printf("Byte read=%c\n",byte);
213 }
214}
215
216You must not change the rest of the program unless you're really sure and
217really need to do it!
218
219+==========================================================+
220| The RLE encoding |
221+==========================================================+
222
223RLE is an acronym that stands for Run Length Encoding. You may encounter it
224as an other acronym: RLC, Run Length Coding.
225
226The idea in this scheme is to recode your data with regard to the repetition
227frames. A frame is one or more bytes that occurr one or several times.
228
229There are several means to encode occurrences. So, you'll have several codecs.
230For example, you may have a sequence such as:
2310,0,0,0,0,0,255,255,255,2,3,4,2,3,4,5,8,11
232
233Some codecs will only deal with the repetitions of '0' and '255' but some other
234will deal with the repetitions of '0', '255', and '2,3,4'.
235
236You have to keep in your mind something important based on this example. A codec
237won't work on all the data you will try to compress. So, in case of non
238existence of sequence repetitions, the codecs based on RLE schemes must not
239display a message to say: "Bye bye". Actually, they will try to encode these
240non repeted data with a value that says "Sorry, I only make a copy of the inital
241input". Of course, a copy of the input data with an header in front of this copy
242will make a biggest output data but if you consider the whole data to compress,
243the encoding of repeated frames will take less space than the encoding
244of non-repeated frames.
245
246All of the algorithms with the name of RLE have the following look with three
247or four values:
248- Value saying if there's a repetition
249- Value saying how many repetitions (or non repetition)
250- Value of the length of the frame (useless if you just encode frame
251with one byte as maximum length)
252- Value of the frame to repeat (or not)
253
254I gave four algorithms to explain what I say.
255
256*** First RLE scheme ***
257
258The first scheme is the simpliest I know, and looks like the one used in MAC
259system (MacPackBit) and some image file formats such as Targa, PCX, TIFF, ...
260
261Here, all compressed blocks begin with a byte, named header, which description
262is:
263
264Bits 7 6 5 4 3 2 1 0
265Header X X X X X X X X
266
267Bits 7: Compression status (1=Compression applied)
268 0 to 6: Number of bytes to handle
269
270So, if the bit 7 is set up to 0, the 0 to 6 bits give the number of bytes
271that follow (minus 1, to gain more over compress) and that were not compressed
272(native bytes). If the bit 7 is set up to 1, the same 0 to 6 bits give
273the number of repetition (minus 2) of the following byte.
274
275As you see, this method only handle frame with one byte.
276
277Additional note: You have 'minus 1' for non-repeated frames because you must
278have at least one byte to compress and 'minus 2' for repeated frames because the
279repetition must be 2, at least.
280
281Compression scheme:
282
283 First byte=Next
284 /\
285 / \
286Count the byte Count the occurrence of NON identical
287occurrences bytes (maximum 128 times)
288(maximum 129 times) and store them in an array
289 | |
290 | |
291 1 bit '1' 1 bit '0'
292+ 7 bits giving + 7 bits giving
293 the number (-2) the number (-1)
294 of repetitions of non repetition
295+ repeated byte + n non repeated bytes
296 | |
297 1xxxxxxx,yyyyyyyy 0xxxxxxx,n bytes
298[-----------------] [----------------]
299
300Example:
301
302Sequence of bytes to encode | Coded values | Differences with compression
303 | | (unit: byte)
304-------------------------------------------------------------------------
305 255,15, | 1,255,15, | -1
306 255,255, | 128,255, | 0
307 15,15, | 128,15, | 0
308 255,255,255, | 129,255, | +1
309 15,15,15, | 129,15, | +1
310 255,255,255,255, | 130,255, | +2
311 15,15,15,15 | 130,15 | +2
312
313See codecs source codes: codrle1.c and dcodrle1.c
314
315*** Second RLE scheme ***
316
317In the second scheme of RLE compression you look for the less frequent byte
318in the source to compress and use it as an header for all compressed block.
319
320In the best cases, the occurrence of this byte is zero in the data to compress.
321
322Two possible schemes, firstly with handling frames with only one byte,
323secondly with handling frames with one byte *and* more. The first case is
324the subject of this current compression scheme, the second is the subject
325of next compression scheme.
326
327For the frame of one byte, header byte is written in front of all repetition
328with at least 4 bytes. It is then followed by the repetition number minus 1 and
329the repeated byte.
330Header byte, Occurrence number-1, repeated byte
331
332If a byte don't repeat more than tree times, the three bytes are written without
333changes in the destination stream (no header nor length, nor repetition in front
334or after theses bytes).
335
336An exception: If the header byte appears in the source one, two, three and up
337times, it'll be respectively encoded as following:
338- Header byte, 0
339- Header byte, 1
340- Header byte, 2
341- Header byte, Occurrence number-1, Header byte
342
343Example, let's take the previous example. A non frequent byte is zero-ASCII
344because it never appears.
345
346Sequence of bytes to encode | Coded values | Differences with compression
347 | | (unit: byte)
348-------------------------------------------------------------------------
349 255,15, | 255,15, | -1
350 255,255, | 255,255, | 0
351 15,15, | 15,15, | 0
352 255,255,255, | 255,255,255, | 0
353 15,15,15, | 15,15,15, | 0
354 255,255,255,255, | 0,3,255, | -1
355 15,15,15,15 | 0,3,15 | -1
356
357If the header would appear, we would see:
358
359Sequence of bytes to encode | Coded values | Differences with compression
360 | | (unit: byte)
361-------------------------------------------------------------------------
362 0, | 0,0, | +1
363 255, | 255, | 0
364 0,0, | 0,1, | 0
365 15, | 15, | 0
366 0,0,0, | 0,2, | -1
367 255, | 255, | 0
368 0,0,0,0 | 0,3,0 | -1
369
370See codecs source codes: codrle2.c and dcodrle2.c
371
372*** Third RLE scheme ***
373
374It's the same idea as the second scheme but we can encode frames with
375more than one byte. So we have three cases:
376
377- If it was the header byte, whatever is its occurrence, you encode it with:
378Header byte,0,number of occurrence-1
379- For frames which (repetition-1)*length>3, encode it as:
380Header byte, Number of frame repetition-1, frame length-1,bytes of frame
381- If no previous cases were detected, you write them as originally (no header,
382nor length, nor repetition in front or after theses bytes).
383
384Example based on the previous examples:
385
386Sequence of bytes to encode | Coded values | Differences with compression
387 | | (unit: byte)
388-----------------------------------------------------------------------------
389 255,15, | 255,15, | 0
390 255,255, | 255,255, | 0
391 15,15, | 15,15, | 0
392 255,255,255, | 255,255,255, | 0
393 15,15,15, | 15,15,15, | 0
394 255,255,255,255, | 255,255,255,255, | 0
395 15,15,15,15, | 15,15,15,15, | 0
396 16,17,18,16,17,18, |16,17,18,16,17,18,| 0
397 255,255,255,255,255, | 0,4,0,255, | -1
398 15,15,15,15,15, | 0,4,0,15, | -1
399 16,17,18,16,17,18,16,17,18,| 0,2,2,16,17,18, | -3
400 16,17,18,19,16,17,18,19 |0,1,3,16,17,18,19 | -1
401
402If the header (value 0) would be met, we would see:
403
404Sequence of bytes to encode | Coded values | Differences with compression
405 | | (unit: byte)
406--------------------------------------------------------------------------
407 0, | 0,0,0, | +2
408 255, | 255, | 0
409 0,0, | 0,0,1, | +1
410 15, | 15, | 0
411 0,0,0, | 0,0,2, | 0
412 255, | 255, | 0
413 0,0,0,0 | 0,0,3 | -1
414
415See codecs source codes: codrle3.c and dcodrle3.c
416
417*** Fourth RLE scheme ***
418
419This last RLE algorithm better handles repetitions of any kind (one byte
420and more) and non repetitions, including few non repetitions, and does not
421read the source by twice as RLE type 3.
422
423Compression scheme is:
424
425 First byte=Next byte?
426 /\
427 Yes / \ No
428 / \
429 1 bit '0' 1 bit '1'
430 / \
431 / \
432 Count the Motif of several
433 occurrences repeated byte?
434 of 1 repeated ( 65 bytes repeated
435 byte (maximum 257 times maxi)
436 16449 times) /\
437 /\ / \
438 / \ / \
439 / \ / \
440 / \ / \
441 1 bit '0' 1 bit '1' 1 bit '0' 1 bit '1'
442+ 6 bits + 14 bits + 6 bits of |
443giving the giving the the length Number of non repetition
444length (-2) length (-66) of the motif (maximum 8224)
445of the of the + 8 bits of /\
446repeated byte repeated byte the number (-2) < 33 / \ > 32
447+ repeated byte + repeated byte of repetition / \
448 | | + bytes of the 1 bit '0' 1 bit '1'
449 | | motif + 5 bits of + 13 bits
450 | | | the numer (-1) of the
451 | | | of non number (-33)
452 | | | repetition of repetition
453 | | | + non + non
454 | | | repeated repeated
455 | | | bytes bytes
456 | | | | |
457 | | | | 111xxxxx,xxxxxxxx,n bytes
458 | | | | [-------------------------]
459 | | | |
460 | | | 110xxxxx,n bytes
461 | | | [----------------]
462 | | |
463 | | 10xxxxxx,yyyyyyyy,n bytes
464 | | [-------------------------]
465 | |
466 | 01xxxxxx,xxxxxxxx,1 byte
467 | [------------------------]
468 |
469 00xxxxxx,1 byte
470[---------------]
471
472Example, same as previously:
473
474Sequence of bytes to encode | Coded values | Differences with compression
475 | | (unit: byte)
476--------------------------------------------------------------------------
477 255,15 | 11000001b,255,15, | +1
478 255,255 | 00000000b,255, | 0
479 15,15 | 00000000b,15, | 0
480 255,255,255 | 00000001b,255, | -1
481 15,15,15 | 00000001b,15, | -1
482 255,255,255,255 | 00000010b,255, | -2
483 15,15,15,15 | 00000010b,15, | -2
484 16,17,18,16,17,18 |10000001b,0,16,17,18,| -1
485 255,255,255,255,255 | 00000011b,255, | -3
486 15,15,15,15,15 | 00000011b,15, | -3
487 16,17,18,16,17,18,16,17,18 | 10000001b,16,17,18, | -4
488 16,17,18,19,16,17,18,19 |10000010b,16,17,18,19| -2
489
490+==========================================================+
491| The Huffman encoding |
492+==========================================================+
493
494This method comes from the searcher who established the algorithm in 1952.
495This method allows both a dynamic and static statistic schemes. A statistic
496scheme works on the data occurrences. It is not as with RLE where you had
497a consideration of the current occurrence of a frame but rather a consideration
498of the global occurrences of each data in the input stream. In this last case,
499frames can be any kinds of sequences you want. On the other hand, Huffman
500static encoding appears in some compressers such as ARJ on PCs. This enforces
501the encoder to consider every statistic as the same for all the data you have.
502Of course, the results are not as good as if it were a dynamic encoding.
503The static encoding is faster than the dynamic encoding but the dynamic encoding
504will be adapted to the statistic of the bytes of the input stream and will
505of course become more efficient by producing shortest output.
506
507The main idea in Huffman encoding is to re-code every byte with regard to its
508occurrence. The more frequent bytes in the data to compress will be encoded with
509less than 8 bits and the others could need 8 bits see even more to be encoded.
510You immediately see that the codes associated to the different bytes won't have
511identical size. The Huffman method will actually require that the binary codes
512have not a fixed size. We speak then about variable length codes.
513
514The dynamical Huffman scheme needs the binary trees for the encoding. This
515enables you to obtain the best codes, adapted to the source data.
516The demonstration won't be given there. To help the neophyt, I will just explain
517what is a binary tree.
518
519A binary tree is special fashion to represent the data. A binary tree is
520a structure with an associated value with two pointers. The term of binary has
521been given because of the presence of two pointers. Because of some conventions,
522one of the pointer is called left pointer and the second pointer is called right
523pointer. Here is a visual representation of a binary tree.
524
525 Value
526 / \
527 / \
528 Value Value
529 / \ / \
530... ... ... ...
531
532One problem with a binary encoding is a prefix problem. A prefix is the first
533part of the representation of a value, e.g. "h" and "he" are prefixes of "hello"
534but not "el". To understand the problem, let's code the letters "A", "B", "C",
535"D", and "E" respectively as 00b, 01b, 10b, 11b, and 100b. When you read
536the binary sequence 00100100b, you are unable to say if this comes from "ACBA"
537or "AEE". To avoid such situations, the codes must have a prefix property.
538And the letter "E" mustn't begin with the sequence of an other code. With "A",
539"B", "C", "D", and "E" respectively affected with 1b, 01b, 001b, 0001b, and
5400000b, the sequence 1001011b will only be decoded as "ACBA".
541
542 1 0
543<- /\ ->
544 / \
545 "A" /\
546 "B" \
547 /\
548 "C" \
549 /\
550 "D" "E"
551
552As you see, with this tree, an encoding will have the prefix property
553if the bytes are at the end of each "branch" and you have no byte at the "node".
554You also see that if you try to reach a character by the right pointer you add
555a bit set to 0 and by the left pointer, you add a bit set to 1 to the current
556code. The previous *bad* encoding provide the following bad tree:
557
558 /\
559 / \
560 / \
561 /\ /\
562 / \ "B" "A"
563 / \
564"D" "C"\
565 / \
566 "E"
567
568You see here that the coder shouldn't put the "C" at a node...
569
570As you see, the largest binary code are those with the longest distance
571from the top of the tree. Finally, the more frequent bytes will be the highest
572in the tree in order you have the shortest encoding and the less frequent bytes
573will be the lowest in the tree.
574
575From an algorithmic point of view, you make a list of each byte you encountered
576in the stream to compress. This list will always be sorted. The zero-occurrence
577bytes are removed from this list. You take the two bytes with the smallest
578occurrences in the list. Whenever two bytes have the same "weight", you take two
579of them regardless to their ASCII value. You join them in a node. This node will
580have a fictive byte value (256 will be a good one!) and its weight will be
581the sum of the two joined bytes. You replace then the two joined bytes with
582the fictive byte. And you continue so until you have one byte (fictive or not)
583in the list. Of course, this process will produce the shortest codes if the list
584remains sorted. I will not explain with arcana hard maths why the result
585is a set of the shortest bytes...
586
587Important: I use as convention that the right sub-trees have a weight greater
588or equal to the weight of the left sub-trees.
589
590Example: Let's take a file to compress where we notice the following
591occurrences:
592
593Listed bytes | Frequences (Weight)
594----------------------------------
595 0 | 338
596 255 | 300
597 31 | 280
598 77 | 24
599 115 | 21
600 83 | 20
601 222 | 5
602
603We will begin by joining the bytes 83 and 222. This will produce a fictive node
6041 with a weight of 20+5=25.
605
606(Fictive 1,25)
607 /\
608 / \
609(222,5) (83,20)
610
611Listed bytes | Frequences (Weight)
612----------------------------------
613 0 | 338
614 255 | 300
615 31 | 280
616 Fictive 1 | 25
617 77 | 24
618 115 | 21
619
620Note that the list is sorted... The smallest values in the frequences are 21 and
62124. That is why we will take the bytes 77 and 115 to build the fictive node 2.
622
623(Fictive 2,45)
624 /\
625 / \
626(115,21) (77,25)
627
628Listed bytes | Frequences (Weight)
629----------------------------------
630 0 | 338
631 255 | 300
632 31 | 280
633 Fictive 2 | 45
634 Fictive 1 | 25
635
636The nodes with smallest weights are the fictive 1 and 2 nodes. These are joined
637to build the fictive node 3 whose weight is 40+25=70.
638
639 (Fictive 3,70)
640 / \
641 / \
642 / \
643 /\ / \
644 / \ / \
645(222,5) (83,20) (115,21) (77,25)
646
647Listed bytes | Frequences (Weight)
648----------------------------------
649 0 | 338
650 255 | 300
651 31 | 280
652 Fictive 3 | 70
653
654The fictive node 3 is linked to the byte 31. Total weight: 280+70=350.
655
656 (Fictive 4,350)
657 / \
658 / \
659 / \
660 / \ (31,280)
661 / \
662 / \
663 /\ / \
664 / \ / \
665(222,5) (83,20) (115,21) (77,25)
666
667Listed bytes | Frequences (Weight)
668----------------------------------
669 Fictive 4 | 350
670 0 | 338
671 255 | 300
672
673As you see, being that we sort the list, the fictive node 4 has become the first
674of the list. We join the bytes 0 and 255 in a same fictive node, the number 5
675whose weight is 338+300=638.
676
677(Fictive 5,638)
678 /\
679 / \
680(255,300) (0,338)
681
682Listed bytes | Frequences (Weight)
683----------------------------------
684 Fictive 5 | 638
685 Fictive 4 | 350
686
687The fictive nodes 4 and 5 are finally joined. Final weight: 638+350=998 bytes.
688It is actually the total byte number in the initial file: 338+300+24+21+20+5.
689
690 (Tree,998)
691 1 / \ 0
692 <- / \ ->
693 / \
694 / \
695 / \
696 / \ / \
697 / \ / \
698 / \ / \
699 / \ (31,280) (255,300) (0,338)
700 / \
701 / \
702 /\ / \
703 / \ / \
704(222,5) (83,20) (115,21) (77,25)
705
706Bytes | Huffman codes | Frequences | Binary length*Frequence
707------------------------------------------------------------
708 0 | 00b | 338 | 676
709 255 | 01b | 300 | 600
710 31 | 10b | 280 | 560
711 77 | 1101b | 24 | 96
712 115 | 1100b | 21 | 84
713 83 | 1110b | 20 | 80
714 222 | 1111b | 5 | 20
715
716Results: Original file size: (338+300+280+24+21+20+5)*8=7904 bits (=998 bytes)
717versus 676+600+560+96+84+80+20=2116 bits, i.e. 2116/8=265 bytes.
718
719Now you know how to code an input stream. The last problem is to decode all this
720stuff. Actually, when you meet a binary sequence you can't say whether it comes
721from such byte list or such other one. Furthermore, if you change the occurrence
722of one or two bytes, you won't obtain the same resulting binary tree. Try for
723example to encode the previous list but with the following occurrences:
724
725Listed bytes | Frequences (Weight)
726----------------------------------
727 255 | 418
728 0 | 300
729 31 | 100
730 77 | 24
731 115 | 21
732 83 | 20
733 222 | 5
734
735As you can observe it, the resulting binary tree is quite different, we had yet
736the same initial bytes. To not be in such a situation we will put an header
737in front of all data. I can't comment longly this header but I can say
738I minimize it as much as I could. The header is divided into two parts.
739The first part of this header looks closely to a boolean table (coded more or
740less in binary to save space) and the second part provide to the decoder
741the binary code associated to each byte encountered in the original input
742stream.
743
744Here is a summary of the header:
745
746First part
747----------
748 First bit
749 / \
750 1 / \ 0
751 / \
752 256 bits set to 0 or 1 5 bits for the number n (minus 1)
753 depending whether the of bytes encountered
754 corresponding byte was in the file to compres
755 in the file to compress |
756 (=> n bits set to 1, \ /
757 n>32) n values of 8-bits (n<=32)
758 \ /
759 \ /
760 \ /
761Second part |
762----------- |
763 |
764 +------------->|
765(n+1) times | |
766(n bytes of | First bit?
767the values | / \
768encountered | 1 / \ 0
769in the | / \
770source file | 8 bits of 5 bits of the
771+ the code | the length length (-1)
772of a | (-1) of the of the following
773fictive | following binary
774byte | binary code code
775to stop the | (length>32) (length<=32)
776decoding. | \ /
777The fictive | \ /
778is set to | \ /
779256 in the | |
780Huffman | binary code
781-tree of | |
782encoding) +--------------|
783 |
784 Binary encoding of the source file
785 |
786 Code of end of encoding
787 |
788
789
790With my codecs I can handle binary sequences with a length of 256 bits.
791This correspond to encode all the input stream from one byte to infinite length.
792In fact if a byte had a range from 0 to 257 instead of 0 to 255, I would have a
793bug with my codecs with an input stream of at least 370,959,230,771,131,880,927,
794453,318,055,001,997,489,772,178,180,790,105 bytes !!!
795
796Where come this explosive number? In fact, to have a severe bug, I must have
797a completely unbalanced tree:
798
799 Tree
800 /\
801 \
802 /\
803 \
804 /\
805 \
806 ...
807 /\
808 \
809 /\
810
811Let's take the following example:
812
813Listed bytes | Frequences (Weight)
814----------------------------------
815 32 | 5
816 101 | 3
817 97 | 2
818 100 | 1
819 115 | 1
820
821This produces the following unbalanced tree:
822
823 Tree
824 /\
825(32,5) \
826 /\
827 (101,3) \
828 /\
829 (97,2) \
830 /\
831 (115,1) (100,1)
832
833Let's speak about a mathematical series: The Fibonacci series. It is defined as
834following:
835
836{ Fib(0)=0
837{ Fib(1)=1
838{ Fib(n)=Fib(n-2)+Fib(n-1)
839
840Fib(0)=0, Fib(1)=1, Fib(2)=1, Fib(3)=2, Fib(4)=3, Fib(5)=5, Fib(6)=8, Fib(7)=13,
841etc.
842
843But 1, 1, 2, 3, 5, 8 are the occurrences of our list! We can actually
844demonstrate that to have an unbalanced tree, we have to take a list with
845an occurrence based on the Fibonacci series (these values are minimal).
846If the data to compress have m different bytes, when the tree is unbalanced,
847the longest code need m-1 bits. In our little previous example where m=5,
848the longest codes are associated to the bytes 100 and 115, respectively coded
8490001b and 0000b. We can also say that to have an unbalanced tree we must have
850at least 5+3+2+1+1=12=Fib(7)-1. To conclude about all that, with a coder that
851uses m-1 bits, you must never have an input stream size over than Fib(m+2)-1,
852otherwise, there could be a bug in the output stream. Of course, with my codecs
853there will never be a bug because I can deal with binary code sizes of 1 to 256
854bits. Some encoder could use that with m=31, Fib(31+2)-1=3,524,577 and m=32,
855Fib(32+2)-1=5,702,886. And an encoder that uses unisgned integer of 32 bits
856shouldn't have a bug until about 4 Gb...
857
858+==========================================================+
859| The LZW encoding |
860+==========================================================+
861
862The LZW scheme is due to three searchers, i.e. Abraham Lempel and Jacob Ziv
863worked on it in 1977, and Terry Welch achieved this scheme in 1984.
864
865LZW is patented in USA. This patent, number 4,558,302, is covered by Unisys
866Corporation. You can usually write (without fees) software codecs which use
867the LZW scheme but hardware companies can't do so. You may get a limited
868licence by writting to:
869Welch Licencing Department
870Office of the General Counsel
871M/S C1SW19
872Unisys corporation
873Blue Bell
874Pennsylvania, 19424 (USA)
875
876If you're occidental, you are surely using an LZW encoding every time you are
877speaking, especially when you use a dictionary. Let's consider, for example,
878the word "Cirrus". As you read a dictionary, you begin with "A", "Aa", and so
879on. But a computer has no experience and it must suppose that some words
880already exist. That is why with "Cirrus", it supposes that "C", "Ci", "Cir",
881"Cirr", "Cirru", and "Cirrus" exist. Of course, being that this is a computer,
882all these words are encoded as index numbers. Every time you go forward, you add
883a new number associated to the new word. Being that a computer is byte-based
884and not alphabetic-based, you have an initial dictionary of 256 letters instead
885of our 26 ('A' to 'Z') letters.
886
887Example: Let's code "XYXYZ". First step, "X" is recognized in the initial
888dictionary of 256 letters as the 89th. Second step, "Y" is read. Does "XY"
889exist? No, then "XY" is stored as the word 256. You write in the output stream
890the ASCII of "X", i.e. 88. Now "YX" is tested as not referenced in the current
891dictionary. It is stored as the word 257. You write now in the output stream 89
892(ASCII of "Y"). "XY" is now met. But now "XY" is known as the reference 256.
893Being that "XY" exists, you test the sequence with one more letter, i.e. "XYZ".
894This last word is not referenced in the current dictionary. You write then the
895value 256. Finally, you reach the last letter ("Z"). You add "YZ" as the
896reference 258 but it is the last letter. That is why you just write the value
89790 (ASCII of "Z").
898
899Another encoding sample with the string "ABADABCCCABCEABCECCA".
900
901+----+-----+---------------+------+----------+-------------------------+------+
902|Step|Input|Dictionary test|Prefix|New symbol|Dictionary |Output|
903| | | | | |D0=ASCII with 256 letters| |
904+----+-----+---------------+------+----------+-------------------------+------+
905| 1 | "A" |"A" in D0 | "A" | "B" | D1=D0 | 65 |
906| | "B" |"AB" not in D0 | | | and "AB"=256 | |
907+----+-----+---------------+------+----------+-------------------------+------+
908| 2 | "A" |"B" in D1 | "B" | "A" | D2=D1 | 66 |
909| | |"BA" not in D1 | | | and "BA"=257 | |
910+----+-----+---------------+------+----------+-------------------------+------+
911| 3 | "D" |"A" in D2 | "A" | "D" | D3=D2 | 65 |
912| | |"AD" not in D2 | | | and "AD"=258 | |
913+----+-----+---------------+------+----------+-------------------------+------+
914| 4 | "A" |"D" in D3 | "D" | "A" | D4=D3 | 68 |
915| | |"DA" not in D3 | | | and "DA"=259 | |
916+----+-----+---------------+------+----------+-------------------------+------+
917| 5 | "B" |"A" in D4 | "AB" | "C" | D5=D4 | 256 |
918| | "C" |"AB" in D4 | | | and "ABC"=260 | |
919| | |"ABC" not in D4| | | | |
920+----+-----+---------------+------+----------+-------------------------+------+
921| 6 | "C" |"C" in D5 | "C" | "C" | D6=D5 | 67 |
922| | |"CC" not in D5 | | | and "CC"=261 | |
923+----+-----+---------------+------+----------+-------------------------+------+
924| 7 | "C" |"C" in D6 | "CC" | "A" | D7=D6 | 261 |
925| | "A" |"CC" in D6 | | | and "CCA"=262 | |
926| | |"CCA" not in D6| | | | |
927+----+-----+---------------+------+----------+-------------------------+------+
928| 8 | "B" |"A" in D7 | "ABC"| "E" | D8=D7 | 260 |
929| | "C" |"AB" in D7 | | | and "ABCE"=263 | |
930| | "E" |"ABC" in D7 | | | | |
931| | <"ABCE" not in D7| | | | |
932+----+-----+---------------+------+----------+-------------------------+------+
933| 9 | "A" |"E" in D8 | "E" | "A" | D9=D8 | 69 |
934| | |"EA" not in D8 | | | and "EA"=264 | |
935+----+-----+---------------+------+----------+-------------------------+------+
936| 10 | "B" |"A" in D9 |"ABCE"| "C" | D10=D9 | 263 |
937| | "C" |"AB" in D9 | | | and "ABCEC"=265 | |
938| | "E" |"ABC" in D9 | | | | |
939| | "C" |"ABCE" in D9 | | | | |
940| | <"ABCEC" not in D9> | | | |
941+----+-----+---------------+------+----------+-------------------------+------+
942| 11 | "C" |"C" in D10 | "CCA"| | | 262 |
943| | "A" |"CC" in D10 | | | | |
944| | <"CCA" not in D10| | | | |
945+----+-----+---------------+------+----------+-------------------------+------+
946
947You will notice a problem with the above output: How to write a code of 256
948(for example) on 8 bits? It's simple to solve this problem. You just say that
949the encoding starts with 9 bits and as you reach the 512th word, you use a
95010-bits encoding. With 1024 words, you use 11 bits; with 2048 words, 12 bits;
951and so on with all numbers of 2^n (n is positive). To better synchronize
952the coder and the decoder with all that, most of implementations use two
953additional references. The word 256 is a code of reinitialisation (the codec
954must reinitialize completely the current dictionary to its 256 initial letters)
955and the word 257 is a code of end of information (no more data to read).
956Of course, you start your first new word as the code number 258.
957
958You can also do so as in the GIF file format and start with an initial
959dictionary of 18 words to code an input stream with only letters coded on 4 bits
960(you start with codes of 5 bits in the output stream!). The 18 initial words
961are: 0 to 15 (initial letters), 16 (reinit the dictionary), and 17 (end of
962information). First new word has code 18, second word, code 19, ...
963
964Important: You can consider that your dictionary is limited to 4096 different
965words (as in GIF and TIFF file formats). But if your dictionary is full, you
966can decide to send old codes *without* reinitializing the dictionary. All the
967decoders must be compliant with this. This enables you to consider that it is
968not efficient to reinitialize the full dictionary. Instead of this, you don't
969change the dictionary and you send/receive (depending if it's a coder or a
970decoder) existing codes in the full dictionary.
971
972My codecs are able to deal as well with most of initial size of data in the
973initial dictionary as with full dictionary.
974
975Let's see how to decode an LZW encoding. We saw with true dynamical Huffman
976scheme that you needed an header in the encoding codes. Any header is useless
977in LZW scheme. When two successive bytes are read, the first must exist in the
978dictionary. This code can be immediately decoded and written in the output
979stream. If the second code is equal or less than the word number in the current
980dictionary, this code is decoded as the first one. At the opposite, if the
981second code is equal to the word number in dictionary plus one, this means you
982have to write a word composed with the word (the sentence, not the code number)
983of the last code plus the first character of the last code. In between, you make
984appear a new word. This new word is the one you just sent to the output stream,
985it means composed by all the letters of the word associated to the first code
986and the first letter of the word of the second code. You continue the processing
987with the second and third codes read in the input stream (of codes)...
988
989Example: Let's decode the previous encoding given a bit more above.
990
991+------+-------+----------------+----------+------------------+--------+
992| Step | Input | Code to decode | New code | Dictionary | Output |
993+------+-------+----------------+----------+------------------+--------+
994| 1 | 65 | 65 | 66 | 65,66=256 | "A" |
995| | 66 | | | | |
996+------+-------+----------------+----------+------------------+--------+
997| 2 | 65 | 66 | 65 | 66,65=257 | "B" |
998+------+-------+----------------+----------+------------------+--------+
999| 3 | 68 | 65 | 68 | 65,68=258 | "A" |
1000+------+-------+----------------+----------+------------------+--------+
1001| 4 | 256 | 68 | 256 | 68,65=259 | "D" |
1002+------+-------+----------------+----------+------------------+--------+
1003| 5 | 67 | 256 | 67 | 65,66,67=260 | "AB" |
1004+------+-------+----------------+----------+------------------+--------+
1005| 6 | 261 | 67 | 261 | 67,67=261 | "C" |
1006+------+-------+----------------+----------+------------------+--------+
1007| 7 | 260 | 261 | 260 | 67,67,65=262 | "CC" |
1008+------+-------+----------------+----------+------------------+--------+
1009| 8 | 69 | 260 | 69 | 65,66,67,69=263 | "ABC" |
1010+------+-------+----------------+----------+------------------+--------+
1011| 9 | 263 | 69 | 263 | 69,65=264 | "E" |
1012+------+-------+----------------+----------+------------------+--------+
1013| 10 | 262 | 263 | 262 |65,66,67,69,67=256| "ABCE" |
1014+------+-------+----------------+----------+------------------+--------+
1015| 11 | | 262 | | | "CCA" |
1016+------+-------+----------------+----------+------------------+--------+
1017
1018Summary: The step 4 is an explicit example. The code to decode is 68 ("D" in
1019ASCII) and the new code is 256. The new word to add to the dictionary is the
1020letters of the first word plus the the first letter of the second code (code
1021256), i.e. 65 ("A" in ASCII) plus 68 ("D"). So the new word has the letters 68
1022and 65 ("AD").
1023
1024The step 6 is quite special. The first code to decode is referenced but the
1025second new code is not referenced being that the dictionary is limited to 260
1026referenced words. We have to make it as the second previously given case, it
1027means you must take the word to decode plus its first letter, i.e. "C"+"C"="CC".
1028Be care, if any encountered code is *upper* than the dictionary size plus 1, it
1029means you have a problem in your data and/or your codecs are...bad!
1030
1031Tricks to improve LZW encoding (but it becomes a non-standard encoding):
1032- To limit the dictionary to an high amount of words (4096 words maximum enable
1033you to encode a stream of a maximmum 7,370,880 letters with the same dictionary)
1034- To use a dictionary of less than 258 if possible (example, with 16 color
1035pictures, you start with a dictionary of 18 words)
1036- To not reinitialize a dictionary when it is full
1037- To reinitialize a dictionary with the most frequent of the previous dictionary
1038- To use the codes from (current dictionary size+1) to (maximum dictionary size)
1039because these codes are not used in the standard LZW scheme.
1040Such a compression scheme has been used (successfully) by Robin Watts
1041<ct93008@ox.ac.uk>.
1042
1043+==========================================================+
1044| Summary |
1045+==========================================================+
1046
1047-------------------------------------------------
1048RLE type 1:
1049Fastest compression. Good ratio for general purpose.
1050Doesn't need to read the data by twice.
1051Decoding fast.
1052-------------------------------------------------
1053RLE type 2:
1054Fast compression. Very good ratio in general (even for general purposes).
1055Need to read the data by twice.
1056Decoding fast.
1057-------------------------------------------------
1058RLE type 3:
1059Slowest compression. Good ratio on image file,quite middle for general purposes.
1060Need to read the data by twice.
1061Change line:
1062#define MAX_RASTER_SIZE 256
1063into:
1064#define MAX_RASTER_SIZE 16
1065to speed up the encoding (but the result decreases in ratio). If you compress
1066with memory buffers, do not modify this line...
1067Decoding fast.
1068-------------------------------------------------
1069RLE type 4:
1070Slow compression. Good ratio on image file, middle in general purposes.
1071Change line:
1072#define MAX_RASTER_SIZE 66
1073into:
1074#define MAX_RASTER_SIZE 16
1075to speed up the encoding (but the result decreases in ratio). If you compress
1076with memory buffers, do not modify this line...
1077Decoding fast.
1078-------------------------------------------------
1079Huffman:
1080Fast compression. Good ratio on text files and similar, middle for general
1081purposes. Interesting method to use to compress a buffer already compressed by
1082RLE types 1 or 2 methods...
1083Decoding fast.
1084-------------------------------------------------
1085LZW:
1086Quite fast compression. Good, see even very good ratio, for general purposes.
1087Bigger the data are, better the compression ratio is.
1088Decoding quite fast.
1089-------------------------------------------------
1090
1091The source codes work on all kinds of computers with a C compiler.
1092With the compiler, optimize the speed run option instead of space option.
1093With UNIX system, it's better to compile them with option -O.
1094If you don't use a GNU compiler, the source file MUST NOT have a size
1095over 4 Gb for RLE 2, 3, and Huffman, because I count the number
1096of occurrences of the bytes.
1097So, with GNU compilers, 'unsigned lont int' is 8 bytes instead of 4 bytes
1098(as normal C UNIX compilers and PCs' compilers, such as Microsoft C++
1099and Borland C++).
1100Actually:
1101* Normal UNIX compilers, => 4 Gb (unsigned long int = 4 bytes)
1102 Microsoft C++ and Borland C++ for PCs
1103* GNU UNIX compilers => 17179869184 Gb (unsigned long int = 8 bytes)
1104
1105+==========================================================+
1106| END |
1107+==========================================================+