Cross-Entropy Loss
The Loss Function
The model produces a probability for every token in the vocabulary. Cross-entropy loss measures how surprised the model is by the correct next token.
Definition
loss = -log(p_correct)
If the model assigns probability 1.0 to the correct token: loss = -log(1) = 0.
If it assigns probability 0.01: loss = -log(0.01) ≈ 4.6.
The model is penalized more the lower the probability it gave to the right answer.
Random Baseline
With a vocabulary of 27 (26 letters + BOS), a random model assigns 1/27 to each token:
baseline loss = -log(1/27) ≈ 3.30
MicroGPT starts around 3.5 (slightly worse than random due to random initialization) and should reach around 2.0 after 1000 training steps.
Implementation
probs = softmax(logits)
loss = -probs[target_id].log()
The Value.log() records the gradient: d(-log(p)) / dp = -1/p.
This is one loss per position. We average over all positions in a document:
loss = (1 / n) * sum(losses)
Your Task
Given logits as Value objects and a target index, compute the cross-entropy loss.