+===========================================================+ | Introduction to the losslessy compression schemes | | Description of the codec source codes | +-----------------------------------------------------------+ | From David Bourgin (E-mail: david.bourgin@ufrima.imag.fr) | | Date: 22/9/94 | +===========================================================+ ------ BE CARE ------ This file (compress.txt) is copyrighted. (c) David Bourgin - 1994 Permission to use this documentation for any purpose other than its incorporation into a commercial product is hereby granted without fee. Permission to copy and distribute this documentation only for non-commercial use is also granted without fee, provided, however, that the above copyright notice appears in all copies, that both that copyright notice and this permission notice appear in supporting documentation. The author makes no representations about the suitability of this documentation for any purpose. It is provided "as is" without express or implied warranty. The source codes you obtain with this file are *NOT* covered by the same copyright, because you can include them for both commercial and non-commercial use. See below for more infos. The source code files (codrl1.c, dcodrl1.c, codrle2.c, dcodrle2.c, codrle3.c, dcodrle3.c, codrle4.c, dcodrle4.c, codhuff.c, dcodhuff.c) are copyrighted. They have been uploaded on ftp in turing.imag.fr (129.88.31.7):/pub/compression on 22/5/94 and have been modified on 22/9/94. (c) David Bourgin - 1994 The source codes I provide have no buggs (!) but being that I make them available for free I have some notes to make. They can change at any time without notice. I assume no responsability or liability for any errors or inaccurracies, make no warranty of any kind (express, implied or statutory) with respect to this publication and expressly disclaim any and all warranties of merchantability, fitness for particular purposes. Of course, if you have some problems to use the information presented here, I will try to help you if I can. If you include the source codes in your application, here are the conditions: - You have to put my name in the header of your source file (not in the excutable program if you don't want) (this item is a must) - I would like to see your resulting application, if possible (this item is not a must, because some applications must remain secret) - Whenever you gain money with your application, I would like to receive a very little part in order to be encouraged to update my source codes and to develop new schemes (this item is not a must) --------------------- There are several means to compress data. Here, we are only going to deal with the losslessy schemes. These schemes are also called non-destructive because you always recover the initial data you had, and this, as soon as you need them. With losslessy schemes, you won't never lose any informations (except perhaps when you store or transmit your data but this is another problem...). In this introduction, we are going to see: - The RLE scheme (with different possible algorithms) - The Huffman schemes (dynamical scheme) - And the LZW scheme For the novice, a compresser is a program able to read several data (e.g. bytes) in input and to write several data in output. The data you obtain from the output (also called compressed data) will - of course - take less space than the the input data. This is true in most of cases, if the compresser works and if the type of the data is correct to be compressed with the given scheme. The codec (coder-decoder) enables you to save space on your hard disk and/or to save the communication costs because you always store/transmit the compressed data. You'll use the decompresser as soon as you need to recover your initial useful data. Note that the compressed data are useless if you have not the decoder... You are doubtless asking "How can I reduce the data size without losing some informations?". It's easy to answer to this question. I'll only take an example. I'm sure you have heard about the morse. This system established in the 19th century use a scheme very close to the huffman one. In the morse you encode the letters to transmit with two kinds of signs. If you encode these two sign possibilities in one bit, the symbol 'e' is transmitted in a single bit and the symbols 'y' and 'z' need four bits. Look at the symbols in the text you are reading, you'll fast understand the compression ratio... Important: The source codes associated to the algorithms I present are completely adaptative on what you need to compress. They all use basical macros on the top of the file. Usually the macros to change are: - beginning_of_data - end_of_data - read_byte - read_block - write_byte - write_block These allow the programmer to modify only a little part of the header of the source codes in order to compress as well memory as files. beginning_of_data(): Macro used to set the program so that the next read_byte() call will read the first byte to compress. end_of_data(): Returns a boolean to know whether there is no more bytes to read from the input stream. Return 0 if there is no more byte to compress, another non-zero value otherwise. read_byte(): Returns a byte read from the input stream if available. write_byte(x): Writes the byte 'x' to the output stream. read_block(...) and write_block(...): Same use as read_byte and write_byte(x) but these macros work on blocks of bytes and not only on a single byte. If you want to compress *from* the memory, before entering in a xxxcoding procedure ('xxx' is the actual extension to replace with a given codec), you have to add a pointer set up to the beginning of the zone to compress. Note that the following pointer 'source_memory_base' is not to add, it is just given here to specify a name to the address of the memory zone you are going to encode or decode. That is the same about source_memory_end which can be either a pointer to create or an existing pointer. unsigned char *source_memory_base, /* Base of the source memory */ *source_memory_end, /* Last address to read. source_memory_end=source_memory_base+source_zone_length-1 */ *source_ptr; /* Used in the xxxcoding procedure */ void pre_start() { source_ptr=source_memory_base; xxxcoding(); } end_of_data() and read_byte() are also to modify to compress *from* memory: #define end_of_data() (source_ptr>source_memory_end) #define read_byte() (*(source_ptr++)) If you want to compress *to* memory, before entering in a xxxcoding procedure ('xxx' is the actual extension to replace with a given codec), you have to add a pointer. Note that the pointer 'dest_memory_base' is not to add, it is just given there to specify the address of the destination memory zone you are going to encode or decode. unsigned char *dest_memory_base, /* Base of the destination memory */ *dest_ptr; /* Used in the xxxcoding procedure */ void pre_start() { dest_ptr=dest_memory_base; xxxcoding(); } Of course, you can combine both from and to memory in the pre_start() procedure. The files dest_file and source_file handled in the main() function are to remove... void pre_start() { source_ptr=source_memory_base; dest_ptr=dest_memory_base; xxxcoding(); } In fact, to write to memory, the problem is in the write_byte(x) procedure. This problem exists because your destination zone can either be a static zone or a dynamically allocated zone. In the two cases, you have to check if there is no overflow, especially if the coder is not efficient and must produce more bytes than you reserved in memory. In the first case, with a *static* zone, write_byte(x) macro should look like that: unsigned long int dest_zone_length, current_size; #define write_byte(x) { if (current_size==dest_zone_length) \ exit(1); \ dest_ptr[current_size++]=(unsigned char)(x); \ } In the static version, the pre_start() procedure is to modify as following: void pre_start() { source_ptr=source_memory_base; dest_ptr=dest_memory_base; dest_zone_length=...; /* Set up to the actual destination zone length */ current_size=0; /* Number of written bytes */ xxxcoding(); } Otherwise, dest_ptr is a zone created by the malloc instruction and you can try to resize the allocated zone with the realloc instruction. Note that I increment the zone one kilo-bytes by one kylo-bytes. You have to add two other variables: unsigned long int dest_zone_length, current_size; #define write_byte(x) { if (current_size==dest_zone_length) \ { dest_zone_length += 1024; \ if ((dest_ptr=(unsigned char *)realloc(dest_ptr,dest_zone_length*sizeof(unsigned char)))==NULL) \ exit(1); /* You can't compress in memory \ => I exit but *you* can make a routine to swap on disk */ \ } \ dest_ptr[current_size++]=(unsigned char)(x); \ } With the dynamically allocated version, change the pre_start() routine as following: void pre_start() { source_ptr=source_memory_base; dest_ptr=dest_memory_base; dest_zone_length=1024; if ((dest_ptr=(unsigned char *)malloc(dest_zone_length*sizeof(unsigned char)))==NULL) exit(1); /* You need at least 1 kb in the dynamical memory ! */ current_size=0; /* Number of written bytes */ xxxcoding(); /* Handle the bytes in dest_ptr but don't forget to free these bytes with: free(dest_ptr); */ } The previously given macros work as: void demo() /* The file opening, closing and variables must be set up by the calling procedure */ { unsigned char byte; /* And not 'char byte' (!) */ while (!end_of_data()) { byte=read_byte(); printf("Byte read=%c\n",byte); } } You must not change the rest of the program unless you're really sure and really need to do it! +==========================================================+ | The RLE encoding | +==========================================================+ RLE is an acronym that stands for Run Length Encoding. You may encounter it as an other acronym: RLC, Run Length Coding. The idea in this scheme is to recode your data with regard to the repetition frames. A frame is one or more bytes that occurr one or several times. There are several means to encode occurrences. So, you'll have several codecs. For example, you may have a sequence such as: 0,0,0,0,0,0,255,255,255,2,3,4,2,3,4,5,8,11 Some codecs will only deal with the repetitions of '0' and '255' but some other will deal with the repetitions of '0', '255', and '2,3,4'. You have to keep in your mind something important based on this example. A codec won't work on all the data you will try to compress. So, in case of non existence of sequence repetitions, the codecs based on RLE schemes must not display a message to say: "Bye bye". Actually, they will try to encode these non repeted data with a value that says "Sorry, I only make a copy of the inital input". Of course, a copy of the input data with an header in front of this copy will make a biggest output data but if you consider the whole data to compress, the encoding of repeated frames will take less space than the encoding of non-repeated frames. All of the algorithms with the name of RLE have the following look with three or four values: - Value saying if there's a repetition - Value saying how many repetitions (or non repetition) - Value of the length of the frame (useless if you just encode frame with one byte as maximum length) - Value of the frame to repeat (or not) I gave four algorithms to explain what I say. *** First RLE scheme *** The first scheme is the simpliest I know, and looks like the one used in MAC system (MacPackBit) and some image file formats such as Targa, PCX, TIFF, ... Here, all compressed blocks begin with a byte, named header, which description is: Bits 7 6 5 4 3 2 1 0 Header X X X X X X X X Bits 7: Compression status (1=Compression applied) 0 to 6: Number of bytes to handle So, if the bit 7 is set up to 0, the 0 to 6 bits give the number of bytes that follow (minus 1, to gain more over compress) and that were not compressed (native bytes). If the bit 7 is set up to 1, the same 0 to 6 bits give the number of repetition (minus 2) of the following byte. As you see, this method only handle frame with one byte. Additional note: You have 'minus 1' for non-repeated frames because you must have at least one byte to compress and 'minus 2' for repeated frames because the repetition must be 2, at least. Compression scheme: First byte=Next /\ / \ Count the byte Count the occurrence of NON identical occurrences bytes (maximum 128 times) (maximum 129 times) and store them in an array | | | | 1 bit '1' 1 bit '0' + 7 bits giving + 7 bits giving the number (-2) the number (-1) of repetitions of non repetition + repeated byte + n non repeated bytes | | 1xxxxxxx,yyyyyyyy 0xxxxxxx,n bytes [-----------------] [----------------] Example: Sequence of bytes to encode | Coded values | Differences with compression | | (unit: byte) ------------------------------------------------------------------------- 255,15, | 1,255,15, | -1 255,255, | 128,255, | 0 15,15, | 128,15, | 0 255,255,255, | 129,255, | +1 15,15,15, | 129,15, | +1 255,255,255,255, | 130,255, | +2 15,15,15,15 | 130,15 | +2 See codecs source codes: codrle1.c and dcodrle1.c *** Second RLE scheme *** In the second scheme of RLE compression you look for the less frequent byte in the source to compress and use it as an header for all compressed block. In the best cases, the occurrence of this byte is zero in the data to compress. Two possible schemes, firstly with handling frames with only one byte, secondly with handling frames with one byte *and* more. The first case is the subject of this current compression scheme, the second is the subject of next compression scheme. For the frame of one byte, header byte is written in front of all repetition with at least 4 bytes. It is then followed by the repetition number minus 1 and the repeated byte. Header byte, Occurrence number-1, repeated byte If a byte don't repeat more than tree times, the three bytes are written without changes in the destination stream (no header nor length, nor repetition in front or after theses bytes). An exception: If the header byte appears in the source one, two, three and up times, it'll be respectively encoded as following: - Header byte, 0 - Header byte, 1 - Header byte, 2 - Header byte, Occurrence number-1, Header byte Example, let's take the previous example. A non frequent byte is zero-ASCII because it never appears. Sequence of bytes to encode | Coded values | Differences with compression | | (unit: byte) ------------------------------------------------------------------------- 255,15, | 255,15, | -1 255,255, | 255,255, | 0 15,15, | 15,15, | 0 255,255,255, | 255,255,255, | 0 15,15,15, | 15,15,15, | 0 255,255,255,255, | 0,3,255, | -1 15,15,15,15 | 0,3,15 | -1 If the header would appear, we would see: Sequence of bytes to encode | Coded values | Differences with compression | | (unit: byte) ------------------------------------------------------------------------- 0, | 0,0, | +1 255, | 255, | 0 0,0, | 0,1, | 0 15, | 15, | 0 0,0,0, | 0,2, | -1 255, | 255, | 0 0,0,0,0 | 0,3,0 | -1 See codecs source codes: codrle2.c and dcodrle2.c *** Third RLE scheme *** It's the same idea as the second scheme but we can encode frames with more than one byte. So we have three cases: - If it was the header byte, whatever is its occurrence, you encode it with: Header byte,0,number of occurrence-1 - For frames which (repetition-1)*length>3, encode it as: Header byte, Number of frame repetition-1, frame length-1,bytes of frame - If no previous cases were detected, you write them as originally (no header, nor length, nor repetition in front or after theses bytes). Example based on the previous examples: Sequence of bytes to encode | Coded values | Differences with compression | | (unit: byte) ----------------------------------------------------------------------------- 255,15, | 255,15, | 0 255,255, | 255,255, | 0 15,15, | 15,15, | 0 255,255,255, | 255,255,255, | 0 15,15,15, | 15,15,15, | 0 255,255,255,255, | 255,255,255,255, | 0 15,15,15,15, | 15,15,15,15, | 0 16,17,18,16,17,18, |16,17,18,16,17,18,| 0 255,255,255,255,255, | 0,4,0,255, | -1 15,15,15,15,15, | 0,4,0,15, | -1 16,17,18,16,17,18,16,17,18,| 0,2,2,16,17,18, | -3 16,17,18,19,16,17,18,19 |0,1,3,16,17,18,19 | -1 If the header (value 0) would be met, we would see: Sequence of bytes to encode | Coded values | Differences with compression | | (unit: byte) -------------------------------------------------------------------------- 0, | 0,0,0, | +2 255, | 255, | 0 0,0, | 0,0,1, | +1 15, | 15, | 0 0,0,0, | 0,0,2, | 0 255, | 255, | 0 0,0,0,0 | 0,0,3 | -1 See codecs source codes: codrle3.c and dcodrle3.c *** Fourth RLE scheme *** This last RLE algorithm better handles repetitions of any kind (one byte and more) and non repetitions, including few non repetitions, and does not read the source by twice as RLE type 3. Compression scheme is: First byte=Next byte? /\ Yes / \ No / \ 1 bit '0' 1 bit '1' / \ / \ Count the Motif of several occurrences repeated byte? of 1 repeated ( 65 bytes repeated byte (maximum 257 times maxi) 16449 times) /\ /\ / \ / \ / \ / \ / \ / \ / \ 1 bit '0' 1 bit '1' 1 bit '0' 1 bit '1' + 6 bits + 14 bits + 6 bits of | giving the giving the the length Number of non repetition length (-2) length (-66) of the motif (maximum 8224) of the of the + 8 bits of /\ repeated byte repeated byte the number (-2) < 33 / \ > 32 + repeated byte + repeated byte of repetition / \ | | + bytes of the 1 bit '0' 1 bit '1' | | motif + 5 bits of + 13 bits | | | the numer (-1) of the | | | of non number (-33) | | | repetition of repetition | | | + non + non | | | repeated repeated | | | bytes bytes | | | | | | | | | 111xxxxx,xxxxxxxx,n bytes | | | | [-------------------------] | | | | | | | 110xxxxx,n bytes | | | [----------------] | | | | | 10xxxxxx,yyyyyyyy,n bytes | | [-------------------------] | | | 01xxxxxx,xxxxxxxx,1 byte | [------------------------] | 00xxxxxx,1 byte [---------------] Example, same as previously: Sequence of bytes to encode | Coded values | Differences with compression | | (unit: byte) -------------------------------------------------------------------------- 255,15 | 11000001b,255,15, | +1 255,255 | 00000000b,255, | 0 15,15 | 00000000b,15, | 0 255,255,255 | 00000001b,255, | -1 15,15,15 | 00000001b,15, | -1 255,255,255,255 | 00000010b,255, | -2 15,15,15,15 | 00000010b,15, | -2 16,17,18,16,17,18 |10000001b,0,16,17,18,| -1 255,255,255,255,255 | 00000011b,255, | -3 15,15,15,15,15 | 00000011b,15, | -3 16,17,18,16,17,18,16,17,18 | 10000001b,16,17,18, | -4 16,17,18,19,16,17,18,19 |10000010b,16,17,18,19| -2 +==========================================================+ | The Huffman encoding | +==========================================================+ This method comes from the searcher who established the algorithm in 1952. This method allows both a dynamic and static statistic schemes. A statistic scheme works on the data occurrences. It is not as with RLE where you had a consideration of the current occurrence of a frame but rather a consideration of the global occurrences of each data in the input stream. In this last case, frames can be any kinds of sequences you want. On the other hand, Huffman static encoding appears in some compressers such as ARJ on PCs. This enforces the encoder to consider every statistic as the same for all the data you have. Of course, the results are not as good as if it were a dynamic encoding. The static encoding is faster than the dynamic encoding but the dynamic encoding will be adapted to the statistic of the bytes of the input stream and will of course become more efficient by producing shortest output. The main idea in Huffman encoding is to re-code every byte with regard to its occurrence. The more frequent bytes in the data to compress will be encoded with less than 8 bits and the others could need 8 bits see even more to be encoded. You immediately see that the codes associated to the different bytes won't have identical size. The Huffman method will actually require that the binary codes have not a fixed size. We speak then about variable length codes. The dynamical Huffman scheme needs the binary trees for the encoding. This enables you to obtain the best codes, adapted to the source data. The demonstration won't be given there. To help the neophyt, I will just explain what is a binary tree. A binary tree is special fashion to represent the data. A binary tree is a structure with an associated value with two pointers. The term of binary has been given because of the presence of two pointers. Because of some conventions, one of the pointer is called left pointer and the second pointer is called right pointer. Here is a visual representation of a binary tree. Value / \ / \ Value Value / \ / \ ... ... ... ... One problem with a binary encoding is a prefix problem. A prefix is the first part of the representation of a value, e.g. "h" and "he" are prefixes of "hello" but not "el". To understand the problem, let's code the letters "A", "B", "C", "D", and "E" respectively as 00b, 01b, 10b, 11b, and 100b. When you read the binary sequence 00100100b, you are unable to say if this comes from "ACBA" or "AEE". To avoid such situations, the codes must have a prefix property. And the letter "E" mustn't begin with the sequence of an other code. With "A", "B", "C", "D", and "E" respectively affected with 1b, 01b, 001b, 0001b, and 0000b, the sequence 1001011b will only be decoded as "ACBA". 1 0 <- /\ -> / \ "A" /\ "B" \ /\ "C" \ /\ "D" "E" As you see, with this tree, an encoding will have the prefix property if the bytes are at the end of each "branch" and you have no byte at the "node". You also see that if you try to reach a character by the right pointer you add a bit set to 0 and by the left pointer, you add a bit set to 1 to the current code. The previous *bad* encoding provide the following bad tree: /\ / \ / \ /\ /\ / \ "B" "A" / \ "D" "C"\ / \ "E" You see here that the coder shouldn't put the "C" at a node... As you see, the largest binary code are those with the longest distance from the top of the tree. Finally, the more frequent bytes will be the highest in the tree in order you have the shortest encoding and the less frequent bytes will be the lowest in the tree. From an algorithmic point of view, you make a list of each byte you encountered in the stream to compress. This list will always be sorted. The zero-occurrence bytes are removed from this list. You take the two bytes with the smallest occurrences in the list. Whenever two bytes have the same "weight", you take two of them regardless to their ASCII value. You join them in a node. This node will have a fictive byte value (256 will be a good one!) and its weight will be the sum of the two joined bytes. You replace then the two joined bytes with the fictive byte. And you continue so until you have one byte (fictive or not) in the list. Of course, this process will produce the shortest codes if the list remains sorted. I will not explain with arcana hard maths why the result is a set of the shortest bytes... Important: I use as convention that the right sub-trees have a weight greater or equal to the weight of the left sub-trees. Example: Let's take a file to compress where we notice the following occurrences: Listed bytes | Frequences (Weight) ---------------------------------- 0 | 338 255 | 300 31 | 280 77 | 24 115 | 21 83 | 20 222 | 5 We will begin by joining the bytes 83 and 222. This will produce a fictive node 1 with a weight of 20+5=25. (Fictive 1,25) /\ / \ (222,5) (83,20) Listed bytes | Frequences (Weight) ---------------------------------- 0 | 338 255 | 300 31 | 280 Fictive 1 | 25 77 | 24 115 | 21 Note that the list is sorted... The smallest values in the frequences are 21 and 24. That is why we will take the bytes 77 and 115 to build the fictive node 2. (Fictive 2,45) /\ / \ (115,21) (77,25) Listed bytes | Frequences (Weight) ---------------------------------- 0 | 338 255 | 300 31 | 280 Fictive 2 | 45 Fictive 1 | 25 The nodes with smallest weights are the fictive 1 and 2 nodes. These are joined to build the fictive node 3 whose weight is 40+25=70. (Fictive 3,70) / \ / \ / \ /\ / \ / \ / \ (222,5) (83,20) (115,21) (77,25) Listed bytes | Frequences (Weight) ---------------------------------- 0 | 338 255 | 300 31 | 280 Fictive 3 | 70 The fictive node 3 is linked to the byte 31. Total weight: 280+70=350. (Fictive 4,350) / \ / \ / \ / \ (31,280) / \ / \ /\ / \ / \ / \ (222,5) (83,20) (115,21) (77,25) Listed bytes | Frequences (Weight) ---------------------------------- Fictive 4 | 350 0 | 338 255 | 300 As you see, being that we sort the list, the fictive node 4 has become the first of the list. We join the bytes 0 and 255 in a same fictive node, the number 5 whose weight is 338+300=638. (Fictive 5,638) /\ / \ (255,300) (0,338) Listed bytes | Frequences (Weight) ---------------------------------- Fictive 5 | 638 Fictive 4 | 350 The fictive nodes 4 and 5 are finally joined. Final weight: 638+350=998 bytes. It is actually the total byte number in the initial file: 338+300+24+21+20+5. (Tree,998) 1 / \ 0 <- / \ -> / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ (31,280) (255,300) (0,338) / \ / \ /\ / \ / \ / \ (222,5) (83,20) (115,21) (77,25) Bytes | Huffman codes | Frequences | Binary length*Frequence ------------------------------------------------------------ 0 | 00b | 338 | 676 255 | 01b | 300 | 600 31 | 10b | 280 | 560 77 | 1101b | 24 | 96 115 | 1100b | 21 | 84 83 | 1110b | 20 | 80 222 | 1111b | 5 | 20 Results: Original file size: (338+300+280+24+21+20+5)*8=7904 bits (=998 bytes) versus 676+600+560+96+84+80+20=2116 bits, i.e. 2116/8=265 bytes. Now you know how to code an input stream. The last problem is to decode all this stuff. Actually, when you meet a binary sequence you can't say whether it comes from such byte list or such other one. Furthermore, if you change the occurrence of one or two bytes, you won't obtain the same resulting binary tree. Try for example to encode the previous list but with the following occurrences: Listed bytes | Frequences (Weight) ---------------------------------- 255 | 418 0 | 300 31 | 100 77 | 24 115 | 21 83 | 20 222 | 5 As you can observe it, the resulting binary tree is quite different, we had yet the same initial bytes. To not be in such a situation we will put an header in front of all data. I can't comment longly this header but I can say I minimize it as much as I could. The header is divided into two parts. The first part of this header looks closely to a boolean table (coded more or less in binary to save space) and the second part provide to the decoder the binary code associated to each byte encountered in the original input stream. Here is a summary of the header: First part ---------- First bit / \ 1 / \ 0 / \ 256 bits set to 0 or 1 5 bits for the number n (minus 1) depending whether the of bytes encountered corresponding byte was in the file to compres in the file to compress | (=> n bits set to 1, \ / n>32) n values of 8-bits (n<=32) \ / \ / \ / Second part | ----------- | | +------------->| (n+1) times | | (n bytes of | First bit? the values | / \ encountered | 1 / \ 0 in the | / \ source file | 8 bits of 5 bits of the + the code | the length length (-1) of a | (-1) of the of the following fictive | following binary byte | binary code code to stop the | (length>32) (length<=32) decoding. | \ / The fictive | \ / is set to | \ / 256 in the | | Huffman | binary code -tree of | | encoding) +--------------| | Binary encoding of the source file | Code of end of encoding | With my codecs I can handle binary sequences with a length of 256 bits. This correspond to encode all the input stream from one byte to infinite length. In fact if a byte had a range from 0 to 257 instead of 0 to 255, I would have a bug with my codecs with an input stream of at least 370,959,230,771,131,880,927, 453,318,055,001,997,489,772,178,180,790,105 bytes !!! Where come this explosive number? In fact, to have a severe bug, I must have a completely unbalanced tree: Tree /\ \ /\ \ /\ \ ... /\ \ /\ Let's take the following example: Listed bytes | Frequences (Weight) ---------------------------------- 32 | 5 101 | 3 97 | 2 100 | 1 115 | 1 This produces the following unbalanced tree: Tree /\ (32,5) \ /\ (101,3) \ /\ (97,2) \ /\ (115,1) (100,1) Let's speak about a mathematical series: The Fibonacci series. It is defined as following: { Fib(0)=0 { Fib(1)=1 { Fib(n)=Fib(n-2)+Fib(n-1) Fib(0)=0, Fib(1)=1, Fib(2)=1, Fib(3)=2, Fib(4)=3, Fib(5)=5, Fib(6)=8, Fib(7)=13, etc. But 1, 1, 2, 3, 5, 8 are the occurrences of our list! We can actually demonstrate that to have an unbalanced tree, we have to take a list with an occurrence based on the Fibonacci series (these values are minimal). If the data to compress have m different bytes, when the tree is unbalanced, the longest code need m-1 bits. In our little previous example where m=5, the longest codes are associated to the bytes 100 and 115, respectively coded 0001b and 0000b. We can also say that to have an unbalanced tree we must have at least 5+3+2+1+1=12=Fib(7)-1. To conclude about all that, with a coder that uses m-1 bits, you must never have an input stream size over than Fib(m+2)-1, otherwise, there could be a bug in the output stream. Of course, with my codecs there will never be a bug because I can deal with binary code sizes of 1 to 256 bits. Some encoder could use that with m=31, Fib(31+2)-1=3,524,577 and m=32, Fib(32+2)-1=5,702,886. And an encoder that uses unisgned integer of 32 bits shouldn't have a bug until about 4 Gb... +==========================================================+ | The LZW encoding | +==========================================================+ The LZW scheme is due to three searchers, i.e. Abraham Lempel and Jacob Ziv worked on it in 1977, and Terry Welch achieved this scheme in 1984. LZW is patented in USA. This patent, number 4,558,302, is covered by Unisys Corporation. You can usually write (without fees) software codecs which use the LZW scheme but hardware companies can't do so. You may get a limited licence by writting to: Welch Licencing Department Office of the General Counsel M/S C1SW19 Unisys corporation Blue Bell Pennsylvania, 19424 (USA) If you're occidental, you are surely using an LZW encoding every time you are speaking, especially when you use a dictionary. Let's consider, for example, the word "Cirrus". As you read a dictionary, you begin with "A", "Aa", and so on. But a computer has no experience and it must suppose that some words already exist. That is why with "Cirrus", it supposes that "C", "Ci", "Cir", "Cirr", "Cirru", and "Cirrus" exist. Of course, being that this is a computer, all these words are encoded as index numbers. Every time you go forward, you add a new number associated to the new word. Being that a computer is byte-based and not alphabetic-based, you have an initial dictionary of 256 letters instead of our 26 ('A' to 'Z') letters. Example: Let's code "XYXYZ". First step, "X" is recognized in the initial dictionary of 256 letters as the 89th. Second step, "Y" is read. Does "XY" exist? No, then "XY" is stored as the word 256. You write in the output stream the ASCII of "X", i.e. 88. Now "YX" is tested as not referenced in the current dictionary. It is stored as the word 257. You write now in the output stream 89 (ASCII of "Y"). "XY" is now met. But now "XY" is known as the reference 256. Being that "XY" exists, you test the sequence with one more letter, i.e. "XYZ". This last word is not referenced in the current dictionary. You write then the value 256. Finally, you reach the last letter ("Z"). You add "YZ" as the reference 258 but it is the last letter. That is why you just write the value 90 (ASCII of "Z"). Another encoding sample with the string "ABADABCCCABCEABCECCA". +----+-----+---------------+------+----------+-------------------------+------+ |Step|Input|Dictionary test|Prefix|New symbol|Dictionary |Output| | | | | | |D0=ASCII with 256 letters| | +----+-----+---------------+------+----------+-------------------------+------+ | 1 | "A" |"A" in D0 | "A" | "B" | D1=D0 | 65 | | | "B" |"AB" not in D0 | | | and "AB"=256 | | +----+-----+---------------+------+----------+-------------------------+------+ | 2 | "A" |"B" in D1 | "B" | "A" | D2=D1 | 66 | | | |"BA" not in D1 | | | and "BA"=257 | | +----+-----+---------------+------+----------+-------------------------+------+ | 3 | "D" |"A" in D2 | "A" | "D" | D3=D2 | 65 | | | |"AD" not in D2 | | | and "AD"=258 | | +----+-----+---------------+------+----------+-------------------------+------+ | 4 | "A" |"D" in D3 | "D" | "A" | D4=D3 | 68 | | | |"DA" not in D3 | | | and "DA"=259 | | +----+-----+---------------+------+----------+-------------------------+------+ | 5 | "B" |"A" in D4 | "AB" | "C" | D5=D4 | 256 | | | "C" |"AB" in D4 | | | and "ABC"=260 | | | | |"ABC" not in D4| | | | | +----+-----+---------------+------+----------+-------------------------+------+ | 6 | "C" |"C" in D5 | "C" | "C" | D6=D5 | 67 | | | |"CC" not in D5 | | | and "CC"=261 | | +----+-----+---------------+------+----------+-------------------------+------+ | 7 | "C" |"C" in D6 | "CC" | "A" | D7=D6 | 261 | | | "A" |"CC" in D6 | | | and "CCA"=262 | | | | |"CCA" not in D6| | | | | +----+-----+---------------+------+----------+-------------------------+------+ | 8 | "B" |"A" in D7 | "ABC"| "E" | D8=D7 | 260 | | | "C" |"AB" in D7 | | | and "ABCE"=263 | | | | "E" |"ABC" in D7 | | | | | | | <"ABCE" not in D7| | | | | +----+-----+---------------+------+----------+-------------------------+------+ | 9 | "A" |"E" in D8 | "E" | "A" | D9=D8 | 69 | | | |"EA" not in D8 | | | and "EA"=264 | | +----+-----+---------------+------+----------+-------------------------+------+ | 10 | "B" |"A" in D9 |"ABCE"| "C" | D10=D9 | 263 | | | "C" |"AB" in D9 | | | and "ABCEC"=265 | | | | "E" |"ABC" in D9 | | | | | | | "C" |"ABCE" in D9 | | | | | | | <"ABCEC" not in D9> | | | | +----+-----+---------------+------+----------+-------------------------+------+ | 11 | "C" |"C" in D10 | "CCA"| | | 262 | | | "A" |"CC" in D10 | | | | | | | <"CCA" not in D10| | | | | +----+-----+---------------+------+----------+-------------------------+------+ You will notice a problem with the above output: How to write a code of 256 (for example) on 8 bits? It's simple to solve this problem. You just say that the encoding starts with 9 bits and as you reach the 512th word, you use a 10-bits encoding. With 1024 words, you use 11 bits; with 2048 words, 12 bits; and so on with all numbers of 2^n (n is positive). To better synchronize the coder and the decoder with all that, most of implementations use two additional references. The word 256 is a code of reinitialisation (the codec must reinitialize completely the current dictionary to its 256 initial letters) and the word 257 is a code of end of information (no more data to read). Of course, you start your first new word as the code number 258. You can also do so as in the GIF file format and start with an initial dictionary of 18 words to code an input stream with only letters coded on 4 bits (you start with codes of 5 bits in the output stream!). The 18 initial words are: 0 to 15 (initial letters), 16 (reinit the dictionary), and 17 (end of information). First new word has code 18, second word, code 19, ... Important: You can consider that your dictionary is limited to 4096 different words (as in GIF and TIFF file formats). But if your dictionary is full, you can decide to send old codes *without* reinitializing the dictionary. All the decoders must be compliant with this. This enables you to consider that it is not efficient to reinitialize the full dictionary. Instead of this, you don't change the dictionary and you send/receive (depending if it's a coder or a decoder) existing codes in the full dictionary. My codecs are able to deal as well with most of initial size of data in the initial dictionary as with full dictionary. Let's see how to decode an LZW encoding. We saw with true dynamical Huffman scheme that you needed an header in the encoding codes. Any header is useless in LZW scheme. When two successive bytes are read, the first must exist in the dictionary. This code can be immediately decoded and written in the output stream. If the second code is equal or less than the word number in the current dictionary, this code is decoded as the first one. At the opposite, if the second code is equal to the word number in dictionary plus one, this means you have to write a word composed with the word (the sentence, not the code number) of the last code plus the first character of the last code. In between, you make appear a new word. This new word is the one you just sent to the output stream, it means composed by all the letters of the word associated to the first code and the first letter of the word of the second code. You continue the processing with the second and third codes read in the input stream (of codes)... Example: Let's decode the previous encoding given a bit more above. +------+-------+----------------+----------+------------------+--------+ | Step | Input | Code to decode | New code | Dictionary | Output | +------+-------+----------------+----------+------------------+--------+ | 1 | 65 | 65 | 66 | 65,66=256 | "A" | | | 66 | | | | | +------+-------+----------------+----------+------------------+--------+ | 2 | 65 | 66 | 65 | 66,65=257 | "B" | +------+-------+----------------+----------+------------------+--------+ | 3 | 68 | 65 | 68 | 65,68=258 | "A" | +------+-------+----------------+----------+------------------+--------+ | 4 | 256 | 68 | 256 | 68,65=259 | "D" | +------+-------+----------------+----------+------------------+--------+ | 5 | 67 | 256 | 67 | 65,66,67=260 | "AB" | +------+-------+----------------+----------+------------------+--------+ | 6 | 261 | 67 | 261 | 67,67=261 | "C" | +------+-------+----------------+----------+------------------+--------+ | 7 | 260 | 261 | 260 | 67,67,65=262 | "CC" | +------+-------+----------------+----------+------------------+--------+ | 8 | 69 | 260 | 69 | 65,66,67,69=263 | "ABC" | +------+-------+----------------+----------+------------------+--------+ | 9 | 263 | 69 | 263 | 69,65=264 | "E" | +------+-------+----------------+----------+------------------+--------+ | 10 | 262 | 263 | 262 |65,66,67,69,67=256| "ABCE" | +------+-------+----------------+----------+------------------+--------+ | 11 | | 262 | | | "CCA" | +------+-------+----------------+----------+------------------+--------+ Summary: The step 4 is an explicit example. The code to decode is 68 ("D" in ASCII) and the new code is 256. The new word to add to the dictionary is the letters of the first word plus the the first letter of the second code (code 256), i.e. 65 ("A" in ASCII) plus 68 ("D"). So the new word has the letters 68 and 65 ("AD"). The step 6 is quite special. The first code to decode is referenced but the second new code is not referenced being that the dictionary is limited to 260 referenced words. We have to make it as the second previously given case, it means you must take the word to decode plus its first letter, i.e. "C"+"C"="CC". Be care, if any encountered code is *upper* than the dictionary size plus 1, it means you have a problem in your data and/or your codecs are...bad! Tricks to improve LZW encoding (but it becomes a non-standard encoding): - To limit the dictionary to an high amount of words (4096 words maximum enable you to encode a stream of a maximmum 7,370,880 letters with the same dictionary) - To use a dictionary of less than 258 if possible (example, with 16 color pictures, you start with a dictionary of 18 words) - To not reinitialize a dictionary when it is full - To reinitialize a dictionary with the most frequent of the previous dictionary - To use the codes from (current dictionary size+1) to (maximum dictionary size) because these codes are not used in the standard LZW scheme. Such a compression scheme has been used (successfully) by Robin Watts . +==========================================================+ | Summary | +==========================================================+ ------------------------------------------------- RLE type 1: Fastest compression. Good ratio for general purpose. Doesn't need to read the data by twice. Decoding fast. ------------------------------------------------- RLE type 2: Fast compression. Very good ratio in general (even for general purposes). Need to read the data by twice. Decoding fast. ------------------------------------------------- RLE type 3: Slowest compression. Good ratio on image file,quite middle for general purposes. Need to read the data by twice. Change line: #define MAX_RASTER_SIZE 256 into: #define MAX_RASTER_SIZE 16 to speed up the encoding (but the result decreases in ratio). If you compress with memory buffers, do not modify this line... Decoding fast. ------------------------------------------------- RLE type 4: Slow compression. Good ratio on image file, middle in general purposes. Change line: #define MAX_RASTER_SIZE 66 into: #define MAX_RASTER_SIZE 16 to speed up the encoding (but the result decreases in ratio). If you compress with memory buffers, do not modify this line... Decoding fast. ------------------------------------------------- Huffman: Fast compression. Good ratio on text files and similar, middle for general purposes. Interesting method to use to compress a buffer already compressed by RLE types 1 or 2 methods... Decoding fast. ------------------------------------------------- LZW: Quite fast compression. Good, see even very good ratio, for general purposes. Bigger the data are, better the compression ratio is. Decoding quite fast. ------------------------------------------------- The source codes work on all kinds of computers with a C compiler. With the compiler, optimize the speed run option instead of space option. With UNIX system, it's better to compile them with option -O. If you don't use a GNU compiler, the source file MUST NOT have a size over 4 Gb for RLE 2, 3, and Huffman, because I count the number of occurrences of the bytes. So, with GNU compilers, 'unsigned lont int' is 8 bytes instead of 4 bytes (as normal C UNIX compilers and PCs' compilers, such as Microsoft C++ and Borland C++). Actually: * Normal UNIX compilers, => 4 Gb (unsigned long int = 4 bytes) Microsoft C++ and Borland C++ for PCs * GNU UNIX compilers => 17179869184 Gb (unsigned long int = 8 bytes) +==========================================================+ | END | +==========================================================+