Lesson 14 of 15
Adam Optimizer
Adaptive Moment Estimation
Plain gradient descent uses the same learning rate for all parameters. Adam adapts the learning rate per-parameter by tracking two running statistics:
- — the first moment (exponential moving average of gradients)
- — the second moment (exponential moving average of squared gradients)
Update Rules
At step , given gradient :
Bias correction (prevents large updates at when and start at 0):
Parameter update:
Default Hyperparameters
The paper suggests: , , , .
- : gradient momentum decays with a 10-step memory
- : squared gradient has ~1000-step memory
- : prevents division by zero
Adam is the default optimizer for most deep learning — it works well across a wide range of architectures and learning rates.
Your Task
Implement adam_update(param, grad, m, v, t, lr, beta1, beta2, eps) returning (new_param, new_m, new_v).
Python runtime loading...
Loading...
Click "Run" to execute your code.