Lesson 10 of 15

Huffman Code Lengths

Huffman Code Lengths

Huffman coding is a lossless data compression scheme that assigns shorter codes to more frequent symbols and longer codes to rarer ones.

Optimal Code Length Formula

For a symbol with probability pp, the optimal code length (in bits) is:

(p)=log2p\ell(p) = \lceil -\log_2 p \rceil

A simplified but instructive greedy approximation uses the floor:

(p)=max(1,log2p)\ell(p) = \max(1, \lfloor -\log_2 p \rfloor)

The minimum of 1 bit ensures every symbol gets at least one bit.

Average Code Length

Once lengths are assigned, the average code length is:

ˉ=ipii bits/symbol\bar{\ell} = \sum_i p_i \cdot \ell_i \text{ bits/symbol}

Shannon's Source Coding Theorem

For any uniquely decodable code:

H(X)ˉ<H(X)+1H(X) \leq \bar{\ell} < H(X) + 1

Huffman coding achieves this bound: the average length is within 1 bit of the entropy.

Example

For probabilities [0.5,0.25,0.125,0.125][0.5, 0.25, 0.125, 0.125]:

  • Lengths: [1,2,3,3]=[1,2,3,3][\lfloor 1 \rfloor, \lfloor 2 \rfloor, \lfloor 3 \rfloor, \lfloor 3 \rfloor] = [1, 2, 3, 3]
  • Average: 0.51+0.252+0.1253+0.1253=1.750.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 = 1.75 bits
  • Entropy: H=1.75H = 1.75 bits (exactly matched here!)
import math

def huffman_code_lengths(probs):
    return [max(1, math.floor(-math.log2(p))) for p in probs]

def average_code_length(probs, lengths):
    return sum(p * l for p, l in zip(probs, lengths))

probs = [0.5, 0.25, 0.125, 0.125]
lengths = huffman_code_lengths(probs)
print(lengths)                            # [1, 2, 3, 3]
print(average_code_length(probs, lengths)) # 1.75

Your Task

Implement:

  • huffman_code_lengths(probs) — returns list of lengths using max(1,log2p)\max(1, \lfloor -\log_2 p \rfloor)
  • average_code_length(probs, lengths) — returns pii\sum p_i \ell_i
Python runtime loading...
Loading...
Click "Run" to execute your code.