-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adam op #4403
Adam op #4403
Conversation
moment1_out = beta1 * moment1 + (1 − beta1) * grad moment2_out = beta2 * moment2 + (1 − beta2) * grad * grad moment1_hat = moment1_out / (1 - beta1^t) moment2_hat = moment2_out / (1 - beta2^t) param_out = param - learning_rate * moment1_hat / (sqrt(moment2_hat) + epsilon)
Passed its own unit test. |
using framework::OperatorWithKernel::OperatorWithKernel; | ||
|
||
protected: | ||
void InferShape(const framework::InferShapeContext &ctx) const override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to change to new InferShapeContextBase
|
||
PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("param")->dims(), | ||
ctx.Input<Tensor>("grad")->dims(), | ||
"Two input of Adam Op's dimension must be same."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have better message here than a generic (Two input of Adam Op's dimension must be same). It would be good to have something like the dimension of the gradient and the moments should be the same.
|
||
float beta1_to_t = std::pow(beta1, t); | ||
float beta2_to_t = std::pow(beta2, t); | ||
auto m1_hat = m1_o / (1 - beta1_to_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can increase the efficiency of this computation by changing the order of the computation. Please refer to the modified computation just before the section 2.1 of the Adam Paper (https://arxiv.org/abs/1412.6980)
AddAttr<float>("beta1", "exponential decay for the first moment"); | ||
AddAttr<float>("beta2", "exponential decay for the second moment"); | ||
AddComment(R"DOC( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also cite the Adam Paper here.
|
||
template <typename T, int MajorType = Eigen::RowMajor, | ||
typename IndexType = Eigen::DenseIndex> | ||
using EigenScalar = framework::EigenScalar<T, MajorType, IndexType>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using this?
Reminder: should learning rate be an attribute or tensor variable? |
I discussed with @reyoung and as per his suggestion, learning rate and time steps should be inputs and not attributes. I am right now trying to figure out if we can pass a float/int as an input instead of a tensor with shape (1,) |
fix #4377
Adam Gradient Descent algorithm.
moment1_out = beta1 * moment1 + (1 − beta1) * grad
moment2_out = beta2 * moment2 + (1 − beta2) * grad * grad
moment1_hat = moment1_out / (1 - beta1^t)
moment2_hat = moment2_out / (1 - beta2^t)
param_out = param - learning_rate * moment1_hat / (sqrt(moment2_hat) + epsilon)