Lesson 17 of 17

Adam Optimizer

Adam: Adaptive Moment Estimation

Plain gradient descent uses the same learning rate for all parameters at all times. Adam is better:

  1. Momentum (first moment m): smooths out noisy gradients by keeping an exponential moving average. Instead of jumping with every new gradient, we drift in a consistent direction.

  2. Adaptive rates (second moment v): parameters with consistently large gradients get smaller steps; parameters with small gradients get larger steps.

The Update Rule

m = beta1 * m + (1 - beta1) * grad       # momentum
v = beta2 * v + (1 - beta2) * grad**2    # variance

m_hat = m / (1 - beta1**(step+1))        # bias correction
v_hat = v / (1 - beta2**(step+1))        # bias correction

param -= lr * m_hat / (v_hat**0.5 + eps)

Bias Correction

At the start of training, m and v are initialized to 0 and are biased toward zero. Dividing by (1 - beta^t) corrects this. As t grows, the correction factor approaches 1.

MicroGPT Hyperparameters

lr = 0.01
beta1 = 0.85   # lower than typical 0.9 — less momentum
beta2 = 0.99   # lower than typical 0.999 — faster adaptation
eps = 1e-8

Your Task

Implement adam_step(p, m, v, step) that performs one Adam update and returns the updated (m, v).

Python runtime loading...
Loading...
Click "Run" to execute your code.