Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 5th No.41】为 Paddle 新增 Rprop API RFC #746

Merged
merged 5 commits into from
Dec 12, 2023

Conversation

WintersMontagne10335
Copy link
Contributor

@WintersMontagne10335 WintersMontagne10335 commented Nov 10, 2023

- `paddle.optimizer.Rprop`

``` python
paddle.optimizer.Rprop(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, the first parameter of the optimizer is learning_ rate, is it not necessary here? please explain the reason why it is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good evening, your Excellency!
In the paper, there is no traditional learning rate. Delta has a similar effect to learning rate, and this notation is retained here.
The following is the pseudo code of the program in the paper, you can take a look:
image

Copy link
Contributor

@jeff41404 jeff41404 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to consider unifying the parameter name with other API of optimizers, preferably referred to as learning_ rate, which can be specified in the user documentation as representing Delta in the paper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 292 to 294
delta = 0.01,
delta_min = 1e-6,
delta_max = 50,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to understand using tuples?
whether the parameter naming of competitors is good? if not, It is better to use the name in the paper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point of view that it is easier to understand using tuples.
Is it ok that we use delta_ Range to name it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the above description, if we use learning_ rate to represent Delta in the paper, then we can use learning_ rate_range or lr_range to name it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 296 to 297
eta_negative = 0.5,
eta_positive = 1.2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to understand using tuples?
whether the parameter naming of competitors is good? if not, It is better to use the name in the paper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that we use etas to name it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etas is ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

eta_negative = 0.5,
eta_positive = 1.2,
grad_clip = None,
name = None
Copy link
Contributor

@jeff41404 jeff41404 Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do parameters maximize and differentiable need to be supported? and why not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two parameters are not included in the paper, nor are they present in other optimizers of Paddle. So I'm not prepared to support them.

Copy link
Contributor

@jeff41404 jeff41404 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we need to summarize the differences and reasons between our implementation and Pytorch here(we are more in line with the paper?), including but not limited to functionality and algorithms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. Do you mean that I should provide the reason why Paddle does not have these two parameters? Or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two parameters is one difference, and after roughly reviewing the code implementation of Pytorch, seems that it is not exactly the same as the description in the paper. It is necessary to confirm in the rfc whether the implementation of Pytorch is consistent with the paper. so we want to answer this question: If we use the algorithm in the paper, will there be any differences between Paddle and Pytorch? and what are the differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your insights are very profound.
The difference between the implementation of pytorch and the paper is that when grade[i] * grade[i-1] < 0, in pytorch, param[i] = param[i-1]; In the paper, param[i] = param[i-1] + sign(grad[i-1]) * learning_rate[i-1], which means param[i] = param[i-2]. Intuitively speaking, due to the lack of some operations, the pytorch method is faster, which is why I decided to adopt a similar implementation approach to pytorch.
But within the limited time, it is difficult for me to give you a rigorous answer because theoretical analysis and conducting experiments are very time-consuming. I have two other PRs to complete. For the sake of the dead line, can we skip this difficult problem and continue to move forward? I will provide a detailed answer to this question in the next PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the basic questions are quite clear now, we can move forward and reconsider in PR

Copy link
Contributor

@jeff41404 jeff41404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeff41404 jeff41404 merged commit d1b39c0 into PaddlePaddle:master Dec 12, 2023
@WintersMontagne10335 WintersMontagne10335 deleted the winters019 branch December 15, 2023 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants