Positional Encoding

Transformers process all tokens in parallel — they have no built-in notion of order. Without positional information, "the cat sat on the mat" and "the mat sat on the cat" would produce identical representations.

Why Position Matters

Self-attention computes a weighted sum over all tokens, treating them as a set. The model cannot tell which token came first, second, or last. Positional encodings inject order information into the token embeddings.

Sinusoidal Positional Encoding

The original Transformer paper ("Attention Is All You Need") uses fixed sinusoidal functions:

PE(pos, 2i)     = sin(pos / 10000^(2i/d_model))
PE(pos, 2i + 1) = cos(pos / 10000^(2i/d_model))

Each dimension i oscillates at a different frequency. Low dimensions change rapidly across positions; high dimensions change slowly. This creates a unique "fingerprint" for every position.

Key Properties

Deterministic: No learned parameters — computed once and added to embeddings
Unique per position: Every position gets a distinct vector
Bounded: Values stay in [-1, 1], so they don't dominate the embeddings
Relative positioning: The encoding of position pos + k can be expressed as a linear function of the encoding at pos

Implementation

import math

def positional_encoding(seq_len, d_model):
    pe = []
    for pos in range(seq_len):
        row = []
        for i in range(d_model):
            angle = pos / (10000 ** (2 * (i // 2) / d_model))
            if i % 2 == 0:
                row.append(math.sin(angle))
            else:
                row.append(math.cos(angle))
        pe.append(row)
    return pe

Position 0 always produces [sin(0), cos(0), ...] = [0, 1, ...].

Your Task

Implement positional_encoding(seq_len, d_model) that returns a list of seq_len vectors, each of length d_model.

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.