Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some dist-training robust cases into fluid benchmark test #11207

Merged
merged 16 commits into from
Jun 11, 2018

Conversation

velconia
Copy link
Collaborator

@velconia velconia commented Jun 5, 2018

Close #11206

velconia added 2 commits June 5, 2018 21:51
2. add learning rate decay feature into fluid benchmark test
3. add L1&L2 regularization feature into fluid benchmark test
4. add error clipping feature into fluid benchmark test
5. add gradient clipping feature into fluid benchmark test
@@ -26,6 +26,10 @@
import paddle.fluid.core as core
import paddle.fluid.framework as framework
from paddle.fluid.executor import Executor
from models.model_base import get_decay_learning_rate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_base is not uploaded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review, I added the benchmark/fluid/models/model_base.py file in next commit


# clone from default main program
inference_program = fluid.default_main_program().clone()

optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
# set gradient clip
set_gradient_clip(args.gradient_clip_method, args.gradient_clip_norm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way that we can disable these settings if the args is empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if clip_method in args is None, these settings will be disabled, and if user do NOT specify the args --gradient_clip_method, the args will be None in the case of default.

the code was like below

def set_gradient_clip(clip_method, clip_norm=1.):
    if not clip_method:
        return None

@typhoonzero
Copy link
Contributor

I'm currently thinking, we can test all these cases using unit test and not CE. Run e2e tests with CE may spend alot of time

@velconia
Copy link
Collaborator Author

velconia commented Jun 8, 2018

Actually, test cases have cover the most part of these features, however, what we need is:

  1. Running an program in distributed environment, e.g., run learning decay with parallel executor in 2 parameter servers and 2 trainers which unit tests could not cover.
  2. These tests does not run to end actually, we just run few iterations each time, which will not cost much time.

@velconia
Copy link
Collaborator Author

velconia commented Jun 8, 2018

So I guess, put these features in fluid benchmark and add about 6-7 cases in ce is a good choice, which will cost around 30s each case

choices=[],
help='Error clipping method, not allowed yet')
parser.add_argument(
'--error_clip_min',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove clipping and other optimization configures in argument, It might be clean if we leave these settings to model configs, Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try:
# the listen_and_serv_op would touch a file which contains the listen port
# on the /tmp directory until it was ready to process all the RPC call.
os.stat("/tmp/paddle.%d.port" % pid)
return
except os.error:
retry_times -= 1
retry_times -= sleep_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems is not for current PR? and retry_times seems is the total count for trying not the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name retry_times to start_left_time to indicate that this is the left time for pserver starting, and this change is for passing the CI

@typhoonzero
Copy link
Contributor

Ref #11213 for adding unit tests

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@typhoonzero typhoonzero merged commit 1cfd3cb into PaddlePaddle:develop Jun 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants