Huffman Code Lengths

Huffman coding is a lossless data compression scheme that assigns shorter codes to more frequent symbols and longer codes to rarer ones.

Optimal Code Length Formula

For a symbol with probability $p$ , the optimal code length (in bits) is:

$\ell(p) = \lceil -\log_2 p \rceil$

A simplified but instructive greedy approximation uses the floor:

$\ell(p) = \max(1, \lfloor -\log_2 p \rfloor)$

The minimum of 1 bit ensures every symbol gets at least one bit.

Average Code Length

Once lengths are assigned, the average code length is:

$\bar{\ell} = \sum_i p_i \cdot \ell_i \text{ bits/symbol}$

Shannon's Source Coding Theorem

For any uniquely decodable code:

$H(X) \leq \bar{\ell} < H(X) + 1$

Huffman coding achieves this bound: the average length is within 1 bit of the entropy.

Example

For probabilities $[0.5, 0.25, 0.125, 0.125]$ :

Lengths: $[\lfloor 1 \rfloor, \lfloor 2 \rfloor, \lfloor 3 \rfloor, \lfloor 3 \rfloor] = [1, 2, 3, 3]$
Average: $0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 = 1.75$ bits
Entropy: $H = 1.75$ bits (exactly matched here!)

import math

def huffman_code_lengths(probs):
    return [max(1, math.floor(-math.log2(p))) for p in probs]

def average_code_length(probs, lengths):
    return sum(p * l for p, l in zip(probs, lengths))

probs = [0.5, 0.25, 0.125, 0.125]
lengths = huffman_code_lengths(probs)
print(lengths)                            # [1, 2, 3, 3]
print(average_code_length(probs, lengths)) # 1.75

Your Task

Implement:

huffman_code_lengths(probs) — returns list of lengths using $\max(1, \lfloor -\log_2 p \rfloor)$
average_code_length(probs, lengths) — returns $\sum p_i \ell_i$

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.