|
| 1 | +# Huffman-coding-compression |
| 2 | +#### Compression & decompression of text files using [huffman coding](https://en.wikipedia.org/wiki/Huffman_coding). |
| 3 | + |
| 4 | +## Usage |
| 5 | +To compress or decompress files, pass the paths to the files as parameters to the program. You can do that by simply dragging and dropping the selected files on the program, and Windows will automatically pass the paths to the files as arguments to the program. |
| 6 | + |
| 7 | +When a file filename.txt is passed to the program, it is compressed and the bytes are written to a file filename.txt.comp, created in the same location as filename.txt is. |
| 8 | + |
| 9 | +When a file filename.txt.comp is passed to the program, it is decompressed and the text is written to a file filename.txt, created in the same location is filename.txt.comp is. |
| 10 | + |
| 11 | +## Inner workings |
| 12 | +### Text encoding |
| 13 | +Both compression and decompression uses ISO-8859-2 (/latin 2) encoding. |
| 14 | + |
| 15 | +### Compression method |
| 16 | +Each character is assigned an alternative code (instead of the latin-2 8-bit one) using [huffman coding](https://en.wikipedia.org/wiki/Huffman_coding). Higher occuring characters simply have shorter codes and less occuring characters longer codes, which most of the time results in smaller total amount of bits. This implementation operates on individual characters, it does not encode sections of text (which would give better compression results). |
| 17 | + |
| 18 | +This specific implementation compresses files to about 70% of their initial size, saving 30% of space. |
| 19 | +### Storing of the tree |
| 20 | +Informations about each character are stored as follows: |
| 21 | + |
| 22 | +`[bit indicating if the character was already encountered in the text]`, |
| 23 | + |
| 24 | +if not: `[length of the character's code]`, `[the character's latin-2 representation]`, `[the character's code]` |
| 25 | + |
| 26 | +if yes: `[the character's code]` |
| 27 | + |
| 28 | +- the character's code length is stored in 4 bits (which means that the maximal code length is 2^4 - 1 = 15 - so in this implementation there's a limit on the amount of different characters that can be present in the text). |
| 29 | + |
| 30 | +- Informations about each characters are not written at the beginning or the end of the file, but directly whenever the character is in the text. |
| 31 | + |
| 32 | +- The bits are stored in bytes, and each byte has 8 bits. If the amount of bits isn't divisible by 8 without a remainder, additional bits must be added. This implementation adds zeros at the beginning of the compressed text, and because the first information on each character is if it has (0) or has not (1) already been encountered, then the first bit on the first character must obviously always be 1. This allows the decompressor to safely recognize and ignore the redundant bits -> all zeros before the first non-zero character. |
| 33 | + |
| 34 | +- The only information added to a character that was already encountered is one bit (0), which indicates that it has already been encountered. Decompressor then starts reading from the first bit and gradually adds next, until it finds a match in whenever it stores already encountered codes and their latin-2 representations. |
| 35 | + |
| 36 | +example: |
| 37 | +<pre> |
| 38 | +letter | code | code length | latin-2 code | |
| 39 | + R 1010 4 (0100) 01001010 |
| 40 | +</pre> |
| 41 | + |
| 42 | +- if not already encountered: |
| 43 | +`[1][0100][01001010][1010]` (in the file written together, like this: `10100010010101010`) |
| 44 | +- if already encountered: |
| 45 | +`[0][1010]` (in the file written together, like this: `01010`) |
0 commit comments