I Compressed Moby Dick with Huffman Trees

The other day I stumbled across Huffman coding and Huffman trees accidentally, I had learned about them within my CS curriculum but never really gave them a second thought. But turns out they're super super cool and interesting. So I implemented it on my own and compressed down Herman Melville's Moby Dick.

Ideally Text compression unlike image or video needs to be lossless because we cannot afford to miss characters as that might change intended meaning of the words or simply just not make any sense.

I really don't want to get into the history of how this came to be, so I'll just define what Huffman coding is, what we do to achieve it and show you my implementation.

Huffman Coding

To keep things simple, we usually represent all characters we use in the English language with 8 bit ASCII codes where each character is mapped to a unique character sequence. On the other hand with Huffman coding we can assign shorter binary sequences to characters according to their frequency. We assign shorter codes to more frequently occurring characters and longer ones to less frequent characters, some characters may have a mapping longer than 8 bits, but it on average tends to balance out due to the more frequently occurring characters being assigned shorter codes.

Huffman Tree

We build a tree structure, often called a Huffman tree to encode and decode our binary mapping for every character. Each leaf node represents a character, and each interior node contains the sum of the frequency of its children. Once our tree is ready we can traverse it to encode and decode. When we pick the left subtree we add a 0 to your character mapping, 1 otherwise. We can use this to formulate a unique binary mapping to a character.

How to build a Huffman tree

Step by step we follow these steps.

Count frequencies
Create leaf nodes corresponding to each character
Build a min heap with each item as node ordered by frequency
To build the actual tree
- Remove two nodes with lowest frequencies
- Create a new node with the sum of frequencies as the parent, left child as smaller node and right child as the larger node
- push new node back into heap
Repeat till there is a complete tree with all characters included in the tree.

Process and Pseudo Code Snippets

To keep the process simple, the conversions to binary were stored in text files, the original book in binary came out to be 9720 kB.

1. Character Frequency Count

We read all the characters in the source file and then count how many times each one occurs and create a frequency map for all characters.

This is so we can determine at what depth of the Huffman tree does a character go.

It was done as simply as

def char_freq():
	file = read('moby-dick.txt')
	freqs = {}
	for c in file:
		key = "<SPACE>" if c == " " else ("\\n" if c == "\n" else c)
		freqs[key] = freqs.get(key, 0) + 1
	return freqs

2. Define a Node structure to hold a character

We define a node data structure that can hold values like character, label, frequency, left sub-tree and right sub-tree, we also overload the function of the < character.

The code looks like following

class Node:
    def __init__(self, char, freq, left, right):
        self.char = char
        self.freq = freq
        self.left = left
        self.right = right

    def __lt__(self, other):
        return self.freq < other.freq

the __lt__ helps the heap we will use to implicitly handle how we determine which node is larger

3. Building the Huffman Tree

We build a heap with all characters as their individual nodes, we take two of the smallest nodes create a parent node out of the sum of frequencies of those nodes, the left child is the smaller node and right child is the larger node, we then push it back into the heap. we repeat this process till there is only one node (i.e. a tree containing all the character nodes) left in the heap.

The process looks something like

import heapq

def build_huffman_tree():
	heap = [Node(c, f) for c, f in freqs.items()]
	heapq.heapify(heap)

	while len(heap) > 1:
		left = heap.heappop()
		right = heap.heappop()
		merged = Node("Internal", left.freq + right.freq, left, right)
		heap.push(merged)

	root = heap[0]
	return root

The tree will look like this when represented using graphviz

4. Generate Huffman Code mappings for all Characters

We perform a Root to Leaf traversal of each unique path on the tree and generate a unique Huffman code mapping for each of the characters in the piece of text by adding a 0 to the code if we traverse left, and adding a 1 if we traverse right.

The code would look like the following.

def generate_huffman_codes(root, prefix="", code_map={}):
	if root is not None:
		if root.char != "Internal":
			code_map[root.char] = prefix
		generate_huffman_codes(root.left, prefix + "0", code_map)
		generate_huffman_codes(root.right, prefix + "1", code_map)
		
	return code_map

5. Encode all characters to binary to compress

Once we have all the binary mappings for each character, We can encode characters from the original text to a new file with the binary codes for each character.

The code to do this is as simple as

def compress_text(code_map):
	file = read('moby-dick.txt')
	bitstring = ""
	for c in file:
		# this temp is just so all of it fits in one line
		temp = code_map["<SPACE>" if c == " " else ("\\n" if c == "\n" else c)]
		bitstring += temp
	file.write('moby-dick-compressed.txt', bitstring)

6. Decode all characters back to original string

Before we call it a day, we need to decode the moby-dick-compressed.txt back to alphanumeric characters and make sure that the compression is lossless and accurate, to do this we traverse through the string and find the correct mapping for each character sequence by traversing the tree.

To be precise we traverse the string, for every character we move down on either the left or right subtree and continue till we hit a leaf node, once we hit a leaf node we know we have our next character.

The code for this would look along the lines of

def decode_huffman_compression(compressed_file, huffman_tree):
	file = read(compressed_file)
	decoded_chars = []
	node = huffman_tree

	for bit in file:
		node = node.left if bit == "0" else node.right
		if node.left is None and node.right is None:
			c = node.char
			if c == '<SPACE>':
				c = " "
			elif c == "\\n":
				c = "\n"
			decoded_chars.append(c)
			node = huffman_tree

	file2.create("moby-dick-reconstructed.txt")
	file2.write("".join(decoded_chars))

Once reconstructed the file was identical to the original moby-dick.txt, i.e. the compression was lossless.

Wrapping up

Finally the size of the compressed binary file came out to be 5473 kB, which is a 43.7% reduction in size. Really loved reading up on these and implementing this on a book as large as Moby Dick. Was really really fun to use the data structures I've only seen on Leetcode in a more practical scenario.

pss pss you should soubscribe to my blog's mailing list. click here to do so.

PS click here if you wanna read other blogs I've written

All Blogs

Also if you wanna know more about Huffman Coding in the context of Information Theory, do check this out, was a very fun watch for me.

Huffman Codes: An Information Theory Perspective