11.9 Adadelta

Adadelta is designed to improve the two main aspects of the Adagrad.
1. the continual decay of learning rates throughout training.
2. the need for a manually selected global learning rate. (Zeiler 2012)

Instead of relyig on a single global learning rate that applies to all parameter dimensions, Adadelta uses the rate of change in the parameters itself to adapt the learning rate (c.p. to eq.11.8.1 of §11.8 RMSProp).
Adadelta uses leaky averages to keep a running estimate of the state statistics, with the hyperparameter ρ controling the relative contributions from the history of past steps, for each update iteration.
The state variable s_t is used to store the weighted history of the 2nd moment of gradients.
And Δw_t is used to store weighted history of the 2nd moment in the change of parameters.
The intuition behind the design of Adadelta is inspired from the Newton's method (see §11.3).
- In Newton's method, for each iteration step the amount of update Δw ∝ g/H, which has a natural dimension in w.
- In Adadelta, the amount of update Δw is thus designed to have a dimension in w.
Implementation in Pytorch : torch.optim.Adadelta(params, lr=1.0, rho=0.9).
- Note that here pytorch allows you to further adjust learning rate η to scale the updating step in eq.11.9.2, w_t = w_t-1 - ηg'_t. The default learning rate parameter is set to be 1.

Provide feedback