The Training Loop

Putting It Together

The training loop is the core of all neural network learning:

for epoch in range(num_epochs):
    for (x, y) in training_data:
        1. Forward pass  → compute prediction a
        2. Compute loss  → L = (a - y)²
        3. Backward pass → compute gradients dw, db
        4. Update params → w -= lr * dw; b -= lr * db

This is stochastic gradient descent (SGD) — we update after every single example.

Batch Processing

In practice, we rarely update after every sample. Instead, we group samples into mini-batches:

Batch size 1 (SGD): Noisy gradients, fast updates, good for escaping local minima
Full batch: Stable gradients, but slow and memory-intensive
Mini-batch (typically 32–256): Best of both worlds — stable enough for convergence, small enough for speed

With mini-batches, we average the gradients across the batch before updating weights.

Convergence

After enough iterations, the weights converge to values that minimize the loss on the training data. You can track this by printing the loss every few epochs.

Overfitting Prevention

Two key techniques prevent overfitting during training:

Dropout: During training, randomly zero out a fraction $p$ of neuron activations each forward pass. This forces the network to learn redundant representations. At test time, all neurons are active but outputs are scaled by $(1-p)$ . Typical dropout rates: 0.1–0.5.

Early stopping: Monitor validation loss during training. When it stops improving for a set number of epochs (the "patience"), stop training and use the weights from the best epoch. This prevents the network from memorising the training data.

One Neuron, One Dimension

We will train the simplest possible network: a single sigmoid neuron on 1D data:

$z = w \cdot x + b \qquad a = \sigma(z) \qquad \mathcal{L} = (a - y)^2$

The backward pass: $\delta = 2(a - y) \cdot a(1-a)$ , then $\Delta w = \delta \cdot x$ , $\Delta b = \delta$ .

Your Task

Implement:

train(X, y, lr, epochs) — SGD training (one sample at a time)
train_batch(X, y, lr, epochs, batch_size) — mini-batch training (average gradients over each batch)

Both initialise $w$ and $b$ to 0.0 and return (w, b) after training.

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.