Cross-Entropy Loss

The Loss Function

The model produces a probability for every token in the vocabulary. Cross-entropy loss measures how surprised the model is by the correct next token.

loss = -log(p_correct)

If the model assigns probability 1.0 to the correct token: loss = -log(1) = 0. If it assigns probability 0.01: loss = -log(0.01) ≈ 4.6.

The model is penalized more the lower the probability it gave to the right answer.

With a vocabulary of 27 (26 letters + BOS), a random model assigns 1/27 to each token:

baseline loss = -log(1/27) ≈ 3.30

MicroGPT starts around 3.5 (slightly worse than random due to random initialization) and should reach around 2.0 after 1000 training steps.

probs = softmax(logits)
loss = -probs[target_id].log()

The Value.log() records the gradient: d(-log(p)) / dp = -1/p.

This is one loss per position. We average over all positions in a document:

loss = (1 / n) * sum(losses)

Given logits as Value objects and a target index, compute the cross-entropy loss.

Python runtime loading...

Click "Run" to execute your code.