Lesson 13 of 15

L2 Regularization

Preventing Overfitting

A network trained long enough on small data will memorise the training set — achieving near-zero training loss while generalising poorly. This is overfitting.

Regularization adds a penalty for large weights to the loss function, pushing the model toward simpler solutions.

L2 Regularization (Weight Decay)

Add the squared norm of all weights to the loss:

Lreg=Ldata+λiwi2\mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{data}} + \lambda \sum_{i} w_i^2

λ\lambda (lambda) controls the trade-off between fitting the data and keeping weights small.

Effect on Gradients

The regularization term adds an extra gradient contribution for each weight:

Lregwi=Ldatawi+2λwi\frac{\partial \mathcal{L}_{\text{reg}}}{\partial w_i} = \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i} + 2\lambda w_i

This pulls every weight toward zero on every update step — equivalent to multiplying weights by (12λη)(1 - 2\lambda\eta) before the gradient update, hence the name weight decay.

Intuition

L2 regularization prefers many small weights over a few large ones. Geometrically, it constrains the weights to lie near the origin.

Your Task

Implement:

  • l2_loss(predictions, targets, weights_flat, lambda_) — MSE plus L2 penalty
  • l2_grad(weight, lambda_) — gradient of λw2\lambda w^2 with respect to ww
Python runtime loading...
Loading...
Click "Run" to execute your code.