Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adam op #4403

Closed
wants to merge 2 commits into from
Closed

Conversation

tonyyang-svail
Copy link

@tonyyang-svail tonyyang-svail commented Sep 26, 2017

fix #4377

Adam Gradient Descent algorithm.

moment1_out = beta1 * moment1 + (1 − beta1) * grad
moment2_out = beta2 * moment2 + (1 − beta2) * grad * grad
moment1_hat = moment1_out / (1 - beta1^t)
moment2_hat = moment2_out / (1 - beta2^t)
param_out = param - learning_rate * moment1_hat / (sqrt(moment2_hat) + epsilon)

Yang and others added 2 commits September 26, 2017 13:34
moment1_out = beta1 * moment1 + (1 − beta1) * grad
moment2_out = beta2 * moment2 + (1 − beta2) * grad * grad
moment1_hat =  moment1_out / (1 - beta1^t)
moment2_hat =  moment2_out / (1 - beta2^t)
param_out = param - learning_rate * moment1_hat / (sqrt(moment2_hat) +
epsilon)
@tonyyang-svail tonyyang-svail self-assigned this Sep 26, 2017
@tonyyang-svail
Copy link
Author

Passed its own unit test.

@tonyyang-svail tonyyang-svail changed the title Adam op 2 Adam op Sep 27, 2017
using framework::OperatorWithKernel::OperatorWithKernel;

protected:
void InferShape(const framework::InferShapeContext &ctx) const override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to change to new InferShapeContextBase


PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("param")->dims(),
ctx.Input<Tensor>("grad")->dims(),
"Two input of Adam Op's dimension must be same.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have better message here than a generic (Two input of Adam Op's dimension must be same). It would be good to have something like the dimension of the gradient and the moments should be the same.


float beta1_to_t = std::pow(beta1, t);
float beta2_to_t = std::pow(beta2, t);
auto m1_hat = m1_o / (1 - beta1_to_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can increase the efficiency of this computation by changing the order of the computation. Please refer to the modified computation just before the section 2.1 of the Adam Paper (https://arxiv.org/abs/1412.6980)

AddAttr<float>("beta1", "exponential decay for the first moment");
AddAttr<float>("beta2", "exponential decay for the second moment");
AddComment(R"DOC(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also cite the Adam Paper here.


template <typename T, int MajorType = Eigen::RowMajor,
typename IndexType = Eigen::DenseIndex>
using EigenScalar = framework::EigenScalar<T, MajorType, IndexType>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this?

@tonyyang-svail
Copy link
Author

Reminder: should learning rate be an attribute or tensor variable?

@abhinavarora
Copy link
Contributor

I discussed with @reyoung and as per his suggestion, learning rate and time steps should be inputs and not attributes. I am right now trying to figure out if we can pass a float/int as an input instead of a tensor with shape (1,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adam Operator.
3 participants