L2 Regularization
Preventing Overfitting
A network trained long enough on small data will memorise the training set — achieving near-zero training loss while generalising poorly. This is overfitting.
Regularization adds a penalty for large weights to the loss function, pushing the model toward simpler solutions.
L2 Regularization (Weight Decay)
Add the squared norm of all weights to the loss:
(lambda) controls the trade-off between fitting the data and keeping weights small.
Effect on Gradients
The regularization term adds an extra gradient contribution for each weight:
This pulls every weight toward zero on every update step — equivalent to multiplying weights by before the gradient update, hence the name weight decay.
Intuition
L2 regularization prefers many small weights over a few large ones. Geometrically, it constrains the weights to lie near the origin.
Your Task
Implement:
l2_loss(predictions, targets, weights_flat, lambda_)— MSE plus L2 penaltyl2_grad(weight, lambda_)— gradient of with respect to