L2 Regularization

Preventing Overfitting

A network trained long enough on small data will memorise the training set — achieving near-zero training loss while generalising poorly. This is overfitting.

Regularization adds a penalty for large weights to the loss function, pushing the model toward simpler solutions.

L2 Regularization (Weight Decay)

Add the squared norm of all weights to the loss:

$\mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{data}} + \lambda \sum_{i} w_i^2$

$\lambda$ (lambda) controls the trade-off between fitting the data and keeping weights small.

Effect on Gradients

The regularization term adds an extra gradient contribution for each weight:

$\frac{\partial \mathcal{L}_{\text{reg}}}{\partial w_i} = \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i} + 2\lambda w_i$

This pulls every weight toward zero on every update step — equivalent to multiplying weights by $(1 - 2\lambda\eta)$ before the gradient update, hence the name weight decay.

Intuition

L2 regularization prefers many small weights over a few large ones. Geometrically, it constrains the weights to lie near the origin.

Your Task

Implement:

l2_loss(predictions, targets, weights_flat, lambda_) — MSE plus L2 penalty
l2_grad(weight, lambda_) — gradient of $\lambda w^2$ with respect to $w$

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.