Lesson 11 of 17

RMSNorm

Normalization

Deep networks suffer from exploding or vanishing activations — values that grow too large or shrink to zero as they pass through many layers. Normalization keeps activations in a stable range.

RMSNorm

RMSNorm normalizes each vector by its root mean square:

rms(x) = sqrt(mean(x²) + ε)
x_norm = x / rms(x)

The small ε = 1e-5 prevents division by zero.

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)   # mean of squares
    scale = (ms + 1e-5) ** -0.5               # 1 / rms
    return [xi * scale for xi in x]

Why RMSNorm over LayerNorm?

LayerNorm subtracts the mean and divides by std. RMSNorm only divides by the RMS — simpler and faster. Modern architectures (LLaMA, Mistral, Gemma) use RMSNorm.

In MicroGPT, RMSNorm is applied before the attention and MLP blocks (pre-norm), which stabilizes training more reliably than post-norm.

Working with Value Objects

All operations here are defined on Value:

  • xi * xi uses Value.__mul__
  • sum(...) / len(x) uses Value.__truediv__
  • (ms + 1e-5) ** -0.5 uses Value.__add__ and Value.__pow__

Your Task

Implement rmsnorm(x) where x is a list of Value objects.

Python runtime loading...
Loading...
Click "Run" to execute your code.