You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From the original paper, we do not find that we need to normalize the grad.
If we use the bias-corrected estimate in adam "Adam A method for stochastic optimization", it should be $m_t = m_t / (1-\beta_1^t)$ $v_t = v_t / (1-\beta_2^t)$
When $t>1$, $1-\beta_1^t \neq 0.1$ (only when t=1, left=right).
Thanks for your solid and insigntful paper.
Lines 377-381 in main.py.
Yet, the pseudocode of original paper "Adaptive Federated Optimization" is
$x_{t+1} = x_{t} + \eta_g \frac{m_t}{\sqrt{v_t}+\tau}$
So maybe
torch.sqrt(delta+epsilon_fedadagrad)
should be tuned intotorch.sqrt(delta)+epsilon_fedadagrad
Since I am not familiar with adagrad algorithm, I am not sure about it. Can you help me with this issue kindly?
The text was updated successfully, but these errors were encountered: