-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some dist-training robust cases into fluid benchmark test #11207
Conversation
2. add learning rate decay feature into fluid benchmark test 3. add L1&L2 regularization feature into fluid benchmark test 4. add error clipping feature into fluid benchmark test 5. add gradient clipping feature into fluid benchmark test
@@ -26,6 +26,10 @@ | |||
import paddle.fluid.core as core | |||
import paddle.fluid.framework as framework | |||
from paddle.fluid.executor import Executor | |||
from models.model_base import get_decay_learning_rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_base is not uploaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review, I added the benchmark/fluid/models/model_base.py file in next commit
|
||
# clone from default main program | ||
inference_program = fluid.default_main_program().clone() | ||
|
||
optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate) | ||
# set gradient clip | ||
set_gradient_clip(args.gradient_clip_method, args.gradient_clip_norm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way that we can disable these settings if the args is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if clip_method in args is None, these settings will be disabled, and if user do NOT specify the args --gradient_clip_method
, the args will be None in the case of default.
the code was like below
def set_gradient_clip(clip_method, clip_norm=1.):
if not clip_method:
return None
I'm currently thinking, we can test all these cases using unit test and not CE. Run e2e tests with CE may spend alot of time |
Actually, test cases have cover the most part of these features, however, what we need is:
|
So I guess, put these features in fluid benchmark and add about 6-7 cases in ce is a good choice, which will cost around 30s each case |
2. remove lr_decay, regularization, clipping out of fluid_benchmark.py
benchmark/fluid/args.py
Outdated
choices=[], | ||
help='Error clipping method, not allowed yet') | ||
parser.add_argument( | ||
'--error_clip_min', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove clipping and other optimization configures in argument, It might be clean if we leave these settings to model configs, Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
2. fix bug in test_listen_and_serv_op
try: | ||
# the listen_and_serv_op would touch a file which contains the listen port | ||
# on the /tmp directory until it was ready to process all the RPC call. | ||
os.stat("/tmp/paddle.%d.port" % pid) | ||
return | ||
except os.error: | ||
retry_times -= 1 | ||
retry_times -= sleep_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems is not for current PR? and retry_times
seems is the total count for trying not the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the name retry_times
to start_left_time
to indicate that this is the left time for pserver starting, and this change is for passing the CI
Ref #11213 for adding unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Close #11206