Adam Optimizer

Adaptive Moment Estimation

Plain gradient descent uses the same learning rate for all parameters. Adam adapts the learning rate per-parameter by tracking two running statistics:

$m_t$ — the first moment (exponential moving average of gradients)
$v_t$ — the second moment (exponential moving average of squared gradients)

Update Rules

At step $t$ , given gradient $g_t$ :

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$

$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

Bias correction (prevents large updates at $t=1$ when $m$ and $v$ start at 0):

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

Parameter update:

$\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}$

Default Hyperparameters

The paper suggests: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\varepsilon = 10^{-8}$ , $\eta = 0.001$ .

$\beta_1 = 0.9$ : gradient momentum decays with a 10-step memory
$\beta_2 = 0.999$ : squared gradient has ~1000-step memory
$\varepsilon$ : prevents division by zero

Adam is the default optimizer for most deep learning — it works well across a wide range of architectures and learning rates.

Your Task

Implement adam_update(param, grad, m, v, t, lr, beta1, beta2, eps) returning (new_param, new_m, new_v).

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.