-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon 5th No.41】为 Paddle 新增 Rprop API RFC #746
【Hackathon 5th No.41】为 Paddle 新增 Rprop API RFC #746
Conversation
WintersMontagne10335
commented
Nov 10, 2023
•
edited
Loading
edited
- 【PaddlePaddle Hackathon 5th】开源贡献个人挑战赛 Paddle#57262
- `paddle.optimizer.Rprop` | ||
|
||
``` python | ||
paddle.optimizer.Rprop( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, the first parameter of the optimizer is learning_ rate
, is it not necessary here? please explain the reason why it is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to consider unifying the parameter name with other API of optimizers, preferably referred to as learning_ rate
, which can be specified in the user documentation as representing Delta
in the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
delta = 0.01, | ||
delta_min = 1e-6, | ||
delta_max = 50, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easier to understand using tuples?
whether the parameter naming of competitors is good? if not, It is better to use the name in the paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your point of view that it is easier to understand using tuples.
Is it ok that we use delta_ Range
to name it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the above description, if we use learning_ rate
to represent Delta
in the paper, then we can use learning_ rate_range
or lr_range
to name it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
eta_negative = 0.5, | ||
eta_positive = 1.2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easier to understand using tuples?
whether the parameter naming of competitors is good? if not, It is better to use the name in the paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok that we use etas
to name it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etas
is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
eta_negative = 0.5, | ||
eta_positive = 1.2, | ||
grad_clip = None, | ||
name = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do parameters maximize
and differentiable
need to be supported? and why not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two parameters are not included in the paper, nor are they present in other optimizers of Paddle. So I'm not prepared to support them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we need to summarize the differences and reasons between our implementation and Pytorch here(we are more in line with the paper?), including but not limited to functionality and algorithms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. Do you mean that I should provide the reason why Paddle does not have these two parameters? Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two parameters is one difference, and after roughly reviewing the code implementation of Pytorch, seems that it is not exactly the same as the description in the paper. It is necessary to confirm in the rfc whether the implementation of Pytorch is consistent with the paper. so we want to answer this question: If we use the algorithm in the paper, will there be any differences between Paddle and Pytorch? and what are the differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your insights are very profound.
The difference between the implementation of pytorch and the paper is that when grade[i] * grade[i-1] < 0
, in pytorch, param[i] = param[i-1]
; In the paper, param[i] = param[i-1] + sign(grad[i-1]) * learning_rate[i-1]
, which means param[i] = param[i-2]
. Intuitively speaking, due to the lack of some operations, the pytorch method is faster, which is why I decided to adopt a similar implementation approach to pytorch.
But within the limited time, it is difficult for me to give you a rigorous answer because theoretical analysis and conducting experiments are very time-consuming. I have two other PRs to complete. For the sake of the dead line, can we skip this difficult problem and continue to move forward? I will provide a detailed answer to this question in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, the basic questions are quite clear now, we can move forward and reconsider in PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM