Attention
The Attention Mechanism
Attention is the core idea of the Transformer. It lets every token look at all previous tokens and decide which ones are relevant to its current representation.
Queries, Keys, Values
Each token is projected into three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What do I send if selected?"
The output is a weighted sum of values, where the weight between Q and each K is determined by their dot product.
Scaled Dot-Product Attention
scores[t] = dot(Q, K[t]) / sqrt(head_dim)
weights = softmax(scores)
output[j] = sum(weights[t] * V[t][j] for t)
We divide by sqrt(head_dim) to prevent the dot products from growing too large (which would saturate softmax, pushing weights to near-0 or near-1).
In MicroGPT
At each new token position, Q is the current token's query. K and V accumulate over all previously seen tokens. This KV-cache enables efficient autoregressive generation — we never recompute old keys and values.
attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
for t in range(len(k_h))]
attn_weights = softmax(attn_logits)
head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
for j in range(head_dim)]
Your Task
Implement single_head_attention(q, keys, values, head_dim) that returns the attended output vector.