Lesson 11 of 17
RMSNorm
Normalization
Deep networks suffer from exploding or vanishing activations — values that grow too large or shrink to zero as they pass through many layers. Normalization keeps activations in a stable range.
RMSNorm
RMSNorm normalizes each vector by its root mean square:
rms(x) = sqrt(mean(x²) + ε)
x_norm = x / rms(x)
The small ε = 1e-5 prevents division by zero.
def rmsnorm(x):
ms = sum(xi * xi for xi in x) / len(x) # mean of squares
scale = (ms + 1e-5) ** -0.5 # 1 / rms
return [xi * scale for xi in x]
Why RMSNorm over LayerNorm?
LayerNorm subtracts the mean and divides by std. RMSNorm only divides by the RMS — simpler and faster. Modern architectures (LLaMA, Mistral, Gemma) use RMSNorm.
In MicroGPT, RMSNorm is applied before the attention and MLP blocks (pre-norm), which stabilizes training more reliably than post-norm.
Working with Value Objects
All operations here are defined on Value:
xi * xiusesValue.__mul__sum(...) / len(x)usesValue.__truediv__(ms + 1e-5) ** -0.5usesValue.__add__andValue.__pow__
Your Task
Implement rmsnorm(x) where x is a list of Value objects.
Python runtime loading...
Loading...
Click "Run" to execute your code.