Lesson 13 of 17

Attention

The Attention Mechanism

Attention is the core idea of the Transformer. It lets every token look at all previous tokens and decide which ones are relevant to its current representation.

Queries, Keys, Values

Each token is projected into three vectors:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What do I send if selected?"

The output is a weighted sum of values, where the weight between Q and each K is determined by their dot product.

Scaled Dot-Product Attention

scores[t] = dot(Q, K[t]) / sqrt(head_dim)
weights = softmax(scores)
output[j] = sum(weights[t] * V[t][j] for t)

We divide by sqrt(head_dim) to prevent the dot products from growing too large (which would saturate softmax, pushing weights to near-0 or near-1).

In MicroGPT

At each new token position, Q is the current token's query. K and V accumulate over all previously seen tokens. This KV-cache enables efficient autoregressive generation — we never recompute old keys and values.

attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
               for t in range(len(k_h))]
attn_weights = softmax(attn_logits)
head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
            for j in range(head_dim)]

Your Task

Implement single_head_attention(q, keys, values, head_dim) that returns the attended output vector.

Python runtime loading...
Loading...
Click "Run" to execute your code.