Lesson 14 of 15

Adam Optimizer

Adaptive Moment Estimation

Plain gradient descent uses the same learning rate for all parameters. Adam adapts the learning rate per-parameter by tracking two running statistics:

  • mtm_t — the first moment (exponential moving average of gradients)
  • vtv_t — the second moment (exponential moving average of squared gradients)

Update Rules

At step tt, given gradient gtg_t:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

Bias correction (prevents large updates at t=1t=1 when mm and vv start at 0):

m^t=mt1β1tv^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Parameter update:

θt=θt1ηm^tv^t+ε\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}

Default Hyperparameters

The paper suggests: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ε=108\varepsilon = 10^{-8}, η=0.001\eta = 0.001.

  • β1=0.9\beta_1 = 0.9: gradient momentum decays with a 10-step memory
  • β2=0.999\beta_2 = 0.999: squared gradient has ~1000-step memory
  • ε\varepsilon: prevents division by zero

Adam is the default optimizer for most deep learning — it works well across a wide range of architectures and learning rates.

Your Task

Implement adam_update(param, grad, m, v, t, lr, beta1, beta2, eps) returning (new_param, new_m, new_v).

Python runtime loading...
Loading...
Click "Run" to execute your code.