The Training Loop
Putting It Together
The training loop is the core of all neural network learning:
for epoch in range(num_epochs):
for (x, y) in training_data:
1. Forward pass → compute prediction a
2. Compute loss → L = (a - y)²
3. Backward pass → compute gradients dw, db
4. Update params → w -= lr * dw; b -= lr * db
This is stochastic gradient descent (SGD) — we update after every single example.
Batch Processing
In practice, we rarely update after every sample. Instead, we group samples into mini-batches:
- Batch size 1 (SGD): Noisy gradients, fast updates, good for escaping local minima
- Full batch: Stable gradients, but slow and memory-intensive
- Mini-batch (typically 32–256): Best of both worlds — stable enough for convergence, small enough for speed
With mini-batches, we average the gradients across the batch before updating weights.
Convergence
After enough iterations, the weights converge to values that minimize the loss on the training data. You can track this by printing the loss every few epochs.
Overfitting Prevention
Two key techniques prevent overfitting during training:
Dropout: During training, randomly zero out a fraction of neuron activations each forward pass. This forces the network to learn redundant representations. At test time, all neurons are active but outputs are scaled by . Typical dropout rates: 0.1–0.5.
Early stopping: Monitor validation loss during training. When it stops improving for a set number of epochs (the "patience"), stop training and use the weights from the best epoch. This prevents the network from memorising the training data.
One Neuron, One Dimension
We will train the simplest possible network: a single sigmoid neuron on 1D data:
The backward pass: , then , .
Your Task
Implement:
train(X, y, lr, epochs)— SGD training (one sample at a time)train_batch(X, y, lr, epochs, batch_size)— mini-batch training (average gradients over each batch)
Both initialise and to 0.0 and return (w, b) after training.