Skip to content

Latest commit

 

History

History
30 lines (15 loc) · 1.74 KB

11.10_adam.md

File metadata and controls

30 lines (15 loc) · 1.74 KB
  • Adam is a rather popular optimization algorithm used in deep learning. It's basically a combination of momentum and RMSprop into one efficient learning algorithm.

Adam algorithm

    

  • Common choices are β1 = 0.9, and β2 = 0.999. -> The momentum vt term adapts more quickly to local changes, whereas the state vector st changes more smoothly with larger window averaging over past steps.

  • The explicit learning rate η provides extra freedom to adjust the step length during training.

  • Nice properties of Adam : the numerator has the momentum term to accelerate the learning process, and the denominator uses the state vector, the squared gradient information accumulated over a window of time, to even out the progress made in each dimension.

Bias correction

    

  • The main reason to include these bias correction terms is to remove the effect of initialization at v0=0 and s0=0 near the early phase of iterations.

  • When t is large, the bias correction factor is small.

  • Implementation in Pytorch : torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999)).

Reference

  • Adam: A Method for Stochastic Optimization -- Kingma & Ba 2014 (arxiv:1412.6980)

  • Adaptive Methods for Nonconvex Optimization -- Zaheer et al. 2018 (NIPS2018)