Lesson 15 of 15

Regularization

Regularization & Evaluation Metrics

Overfitting occurs when a model memorises the training data and fails to generalise. Regularization adds a penalty term to the loss function that discourages large weights.

L1 Regularization (Lasso)

LL1=λiwi\mathcal{L}_{\text{L1}} = \lambda \sum_{i} |w_i|

L1 drives many weights to exactly zero, producing sparse models.

L2 Regularization (Ridge)

LL2=λiwi2\mathcal{L}_{\text{L2}} = \lambda \sum_{i} w_i^2

L2 shrinks all weights smoothly towards zero but rarely makes them exactly zero.

Elastic Net

A combination of L1 and L2:

LEN=λ1iwi+λ2iwi2\mathcal{L}_{\text{EN}} = \lambda_1 \sum_i |w_i| + \lambda_2 \sum_i w_i^2

Ridge Gradient

When L2 is added to the MSE loss, the gradient of wiw_i gains an extra term:

wi(MSE+λw22)=wiMSE+2λwi\nabla_{w_i} (\text{MSE} + \lambda \|\mathbf{w}\|_2^2) = \nabla_{w_i} \text{MSE} + 2\lambda w_i

This is why ridge regression is equivalent to weight decay: each weight is slightly shrunk at every update.

Evaluation Metrics for Classification

Accuracy alone is misleading on imbalanced datasets (e.g., 99% negative class → a model that always predicts "negative" gets 99% accuracy). Instead, use the confusion matrix counts:

Predicted PositivePredicted Negative
Actually PositiveTP (True Positive)FN (False Negative)
Actually NegativeFP (False Positive)TN (True Negative)

From these we derive:

  • Precision: Of all predicted positives, how many are correct? P=TPTP+FPP = \frac{\text{TP}}{\text{TP} + \text{FP}}
  • Recall (sensitivity): Of all actual positives, how many did we find? R=TPTP+FNR = \frac{\text{TP}}{\text{TP} + \text{FN}}
  • F1 Score: The harmonic mean of precision and recall: F1=2PRP+RF_1 = \frac{2PR}{P + R}

F1 balances the trade-off: high precision with low recall (conservative model) vs. high recall with low precision (aggressive model). Return 0.0 when both precision and recall are zero.

Your Task

Implement:

  • l1_penalty(w, lambda_)λwi\lambda \sum |w_i|
  • l2_penalty(w, lambda_)λwi2\lambda \sum w_i^2
  • elastic_net(w, lambda1, lambda2) → L1 + L2
  • ridge_gradient(w, grad_w, lambda_) → element-wise +2λw\nabla + 2\lambda w
  • precision_recall_f1(y_true, y_pred) → tuple of (precision, recall, f1)
Python runtime loading...
Loading...
Click "Run" to execute your code.