Positional Encoding
Positional Encoding
Transformers process all tokens in parallel — they have no built-in notion of order. Without positional information, "the cat sat on the mat" and "the mat sat on the cat" would produce identical representations.
Why Position Matters
Self-attention computes a weighted sum over all tokens, treating them as a set. The model cannot tell which token came first, second, or last. Positional encodings inject order information into the token embeddings.
Sinusoidal Positional Encoding
The original Transformer paper ("Attention Is All You Need") uses fixed sinusoidal functions:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i + 1) = cos(pos / 10000^(2i/d_model))
Each dimension i oscillates at a different frequency. Low dimensions change rapidly across positions; high dimensions change slowly. This creates a unique "fingerprint" for every position.
Key Properties
- Deterministic: No learned parameters — computed once and added to embeddings
- Unique per position: Every position gets a distinct vector
- Bounded: Values stay in
[-1, 1], so they don't dominate the embeddings - Relative positioning: The encoding of position
pos + kcan be expressed as a linear function of the encoding atpos
Implementation
import math
def positional_encoding(seq_len, d_model):
pe = []
for pos in range(seq_len):
row = []
for i in range(d_model):
angle = pos / (10000 ** (2 * (i // 2) / d_model))
if i % 2 == 0:
row.append(math.sin(angle))
else:
row.append(math.cos(angle))
pe.append(row)
return pe
Position 0 always produces [sin(0), cos(0), ...] = [0, 1, ...].
Your Task
Implement positional_encoding(seq_len, d_model) that returns a list of seq_len vectors, each of length d_model.