Skip to content

Codes accompanying the paper "LaProp: a Better Way to Combine Momentum with Adaptive Gradient"

License

Notifications You must be signed in to change notification settings

Z-T-WANG/LaProp-Optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LaProp-Optimizer

Codes accompanying the paper LaProp: Separating Momentum and Adaptivity in Adam

Use

This implementation is based on Pytorch. The LaProp optimizer is the class LaProp in file laprop.py, which is adapted from the standard optimizer optim.Adam of Pytorch. laprop.LaProp uses the same calling signature as the standard optim.Adam, only with an additional optional argument centered = False controlling whether to compute the centered second moment instead of the squared gradient. The input argument betas corresponds to the tuple in our paper.

The learning rate and the weight decay are decoupled in laprop.LaProp, and therefore when one wants to apply a learning rate schedule with weight decay, one needs to decay 'lr' and 'weight_decay' simultaneously in the optimizer.

When centered is enabled, the optimizer will update for self.steps_before_using_centered = 10 steps in the non-centered way to accumulate information of the gradient, and after that it starts to use the centered strategy. The number of the non-centered steps is tentatively set to 10 at its initialization.

Additional Details compared with the Paper

In laprop.LaProp, we have combined the learning rate and the accumulated momentum into one term, so that when the learning rate changes, the momentum accumulated by a larger learning rate still has a larger effect.

The bias correction terms are treated similarly. Especially, the momentum bias correction is computed from the learning rate and the momentum hyperparameter at each step, so that the bias correction is guaranteed in the presence of a changing learning rate and momentum parameter; the squared gradient bias correction is only computed from the beta2 hyperparameter at each step and does not involve the learning rate.

Future Work

We will add the codes for all the numerical experiments described in our paper.

Citation

If you use LaProp in your research, please cite our paper with the following bibtex item

@article{ziyin2020laprop,
  title={LaProp: a Better Way to Combine Momentum with Adaptive Gradient},
  author={Ziyin, Liu and Wang, Zhikang T and Ueda, Masahito},
  journal={arXiv preprint arXiv:2002.04839},
  year={2020}
}

About

Codes accompanying the paper "LaProp: a Better Way to Combine Momentum with Adaptive Gradient"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages