Build Huffman Compression in Ruby
January 8th, 2008 by proj
Many wonderful things can be done with binary trees. One brilliant usage of the binary tree was proposed by David Huffman in 1951 at MIT which has since become the foundation for much of the compression technology available today. Huffman discovered a simple way to generate a provable minimal binary encoding given a set of input symbols. It’s a deep subject, and one which I’d eventually like to spend more time getting into. For now I’ll walk through how this simple idea can be used to implement some decent compression using pure Ruby.
First of all you have your basic binary tree representation. I previously showed a simple binary tree example that demonstrates how binary trees are easily manipulated in ruby. Every binary tree node holds a value and links off to a right and left nodes that may also hold values. A leaf node is a node with dead links (left and right are both set to nil).
In the Huffman tree the leaves contain a single symbol while interior nodes summarize their sub-tree. I have simplified the idea so that all nodes wrap what is called an association list (assoc list). This is an old lisp idiom of a list containing name value pairs in the form ((name value) (name value)). To find a value in an association list you just walk the list looking at the first element of each sub-list. A reverse association list (rassoc list) takes the same form as an assoc list with the name and value pair swapped. For very small sets of data the performance of assoc lists is better than deeper structures such as a hash or tree. Ruby provides a few array APIs to deal with assoc and rassoc lists. I always store a rassoc list with the (occurance count, symbol) in a leaf and a sum of all contained lists in interior nodes. None of this information is important in understanding how Huffman encoding works but I use this concept below so I thought I should at least call it out.
class Node
# binary tree representation
attr_accessor :value, :left, :right
def initialize(value=nil, left=nil, right=nil)
@value, @left, @right = value, left, right
end
def leaf?
return @left == nil && @right == nil
end
end
By default the node is initialized to be empty with no links. A simple query method ‘leaf?’ is added for some syntactic sugar below when dealing with the tree.
The first step in building the Huffman tree is to build a set of occurrences for the data you want to encode. Some algorithms are able to this dynamically for streams of data but in this case I walk the entire input text first to find all symbols. In this implementation a symbol is a single character. Keep in mind that it is possible to use words or other unique symbols instead of single characters which could result in better compression.
def occurrences(text)
# return an ordered array of [occurrences, character] ascending order
occ = Hash.new(0)
text.scan /./m do |b| occ[b] += 1 end
return occ.keys.map {|k| [occ[k], k]}.sort
end
The hash table occ is initialized with a default value of zero which causes all new entries to start at zero. Every character of the text is scanned and the characters are counted using the hash table. The final stage is to create an array of [count, character] entries in sorted order. Map executes the block over each key looking up the value of the key and outputting a single rassoc element per hash entry.
The final stage in building the tree is to convert the list into a tree representation. The occurrence entries are converted into leaf nodes and put into a queue in ascending order. Two entries are removed from the queue merged onto an interior node and placed onto a second queue. The interior node sums the rassoc list from both nodes that are consumed. If there are no nodes in the primary queue a node will be removed from the secondary queue. The final node remaining is the root of the tree.
def huffman(occlist)
# build a huffman encoding binary tree from the occurrence list
if occlist.empty? then
puts "warning: no occurrences provided to build huffman tree"
return nil
end
# create the initial queue with leaves, trees contain assoc lists
leaves = occlist.map {|entry| Node.new([entry]) }
interior = []
deq = lambda {
if (leaves.length > 0) then leaves.delete_at 0
else interior.delete_at 0 end
}
# create interior nodes
while leaves.length + interior.length > 1
l, r = deq.call, deq.call
node = Node.new(l.value + r.value, l, r)
interior << node
end
deq.call
end
Interior nodes are only formed in the second portion of the Huffman build function. It will only consume leaf nodes or or other interior nodes that represent the edge of the tree. In this way the tree is built bottom up and balanced.
Next up is a bit of a utility function that takes a Huffman tree and a symbol and converts it into an array of 0s and 1s that represent the binary encoding of the symbol. Walking the tree is represented by a 0 or 1 for the left and right sub-trees respectively.
def bits(ht, sym)
# given a tree and symbol return an array of encoded bits
def bits_internal(ht, sym, enc)
return enc, nil if not ht
return enc, ht.value if ht.leaf?
if ht.left.value.rassoc(sym) != nil
enc << 0
bits_internal(ht.left, sym, enc)
else
enc << 1
bits_internal(ht.right, sym, enc)
end
end
if ht.value.rassoc(sym) == nil
puts "warning: no binary encoding for: '#{sym}'"
return []
end
enc, value = bits_internal(ht, sym, [])
if value[0][1] != sym
puts "warning: binary encoding: #{enc.inspect}, value: #{value.inspect} does not match '#{sym}'"
return []
else
return enc
end
end
Bits makes use of the internally stored assoc lists to figure out which sub-tree should be visited. There are more efficient ways to implement this but this method is fine for example purposes. Bits() uses an internal method bitsinternal() which is implemented in a recursive style similar to visit() from the previous binary tree code. Bits does a bit of error checking on the result from bitsinternal() before returning the encoding value back to the caller.
Similar to bits_internal() above decode() will visit the tree using the binary representation in an array and choose the left or right sub-trees. Each time decode calls itself it uses a slice of the original array. If you are familiar with lisp code this method of traversal will look very familiar. When we have arrived at a leaf we know that we have found the symbol and it is returned.
def decode(ht, bits)
if ht.leaf?
return ht.value[0][1]
end
case bits[0]
when 0
decode(ht.left, bits[1..bits.length-1])
when 1
decode(ht.right, bits[1..bits.length-1])
end
end
Now for a simple test of the Huffman encoding procedures above. The standard Lorem Ipsum text will be used. The history of this randomized Latin text is very interesting, it is extracted from an early text on ethics by Cicero which was written in 45 BC.
text = <<TEXT
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Sed est nulla, suscipit vel, tempus sit amet, viverra sit amet, dui. Nunc ultrices, purus vulputate luctus sodales, mauris augue elementum diam, in ornare neque nisi pharetra lectus. In hac habitasse platea dictumst. Phasellus justo turpis, laoreet id, semper at, convallis a, nisi. Duis iaculis erat et mauris. Donec a arcu. Ut sed risus vel mi mollis vehicula. Aenean laoreet, lorem dapibus aliquam ultrices, ante velit vestibulum sem, vel molestie arcu elit sit amet nunc. Vivamus venenatis placerat dui. Mauris porttitor varius velit.
TEXT
To measure the effectiveness of the compression I will sum the bits of the raw text assuming each character were to take a modern standard byte (8bits).
# build a huffman encoding tree
ht = huffman(occurances(text))
# measure the compression
bitsum = 0
for c in text.scan /./m
bitsum += bits(ht, c).length
end
orig = text.length
new = bitsum / 8.0
printf("original %s bytes, huffman encoded: %s bytes, ratio: %.2f%%",
orig, new, new/orig*100)
And what do we get for our efforts on this humble piece of ruby code:
original 598 bytes, huffman encoded: 374.25 bytes, ratio: 62.58%
~63% compression ratio. Not too shabby!
Very interesting! There’s been more than a decade since I first implemented Huffman as a college exercise in Pascal. It is good to see it done in Ruby once again. Programmers have to understand the importance of algorithm study. Kudos for bringing this up.
I’m pretty sure your implementation is wrong. It should be 312.625 bytes. The reason why it is wrong is because you don’t always combine the nodes with the two lowest frequencies (the interior Array should be ordered).
Akita, thank you. Keep up the great work on your site!
Nitpicker, I didn’t implement the weight sorting as you point out. I’ll look for an elegant way to approach that tomorrow. Sounds like you already solved that particular issue. Well done.
Very interesting, I’m new to Ruby and frankly it amazes me with its elegance… I ran your code with larger text (128 KB) and it seems to go forever, now I had a similar issue when I wrote a huffman coding/enconding in Lisp before, I had the intention of visiting it again to improve on it but that never happened.
I’m wondering if you know where the bottle neck is here in your code, what could we do to improve the efficiency…I’ll make this an exercise for me to learn more about Ruby, but would appreciate pointers from you also…
Thank you
forgot to mention…. The mainly slow part in my Lisp code is the writing of binary bytes….which I think could be improved if there is some sort of buffering where big chunks are written at once..
Could this be the issue with your code?
Regars