-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add convenient selection of data parallelism #335
Conversation
Co-authored-by: Aaron Berdy <[email protected]>
@@ -134,6 +135,10 @@ def create( | |||
to execute the job. Default: InstanceConfig(instanceType='ml.m5.large', | |||
instanceCount=1, volumeSizeInGB=30). | |||
|
|||
distribution (str): A str that specifies how the job should be distributed. If set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a note that it's intended for use with >1 instance count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Data parallel distribution could also be used with a single multi-gpu instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize that. In that case, is there any potential a user may want to create a local job with data parallel distribution if their local hardware supports it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can cut local mode out of scope. According to this link, SageMaker local mode does not support distributed training with local GPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change the name from distribution to data_parallel
The option is kept as |
Issue #, if available:
Description of changes:
Adds a parameter to easily populate the hyperparameters of CreateJob for the use of sagemaker data parallelism.
Testing done:
unit tests, integ tests
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.