Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add convenient selection of data parallelism #335

Merged
merged 3 commits into from
Apr 26, 2022

Conversation

shpface
Copy link
Contributor

@shpface shpface commented Apr 19, 2022

Issue #, if available:

Description of changes:
Adds a parameter to easily populate the hyperparameters of CreateJob for the use of sagemaker data parallelism.

Testing done:
unit tests, integ tests

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have checked that my tests are not configured for a specific region or account (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@shpface shpface requested a review from a team as a code owner April 19, 2022 23:34
@shpface shpface requested a review from ajberdy April 25, 2022 17:13
Co-authored-by: Aaron Berdy <[email protected]>
@@ -134,6 +135,10 @@ def create(
to execute the job. Default: InstanceConfig(instanceType='ml.m5.large',
instanceCount=1, volumeSizeInGB=30).

distribution (str): A str that specifies how the job should be distributed. If set to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note that it's intended for use with >1 instance count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Data parallel distribution could also be used with a single multi-gpu instance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize that. In that case, is there any potential a user may want to create a local job with data parallel distribution if their local hardware supports it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can cut local mode out of scope. According to this link, SageMaker local mode does not support distributed training with local GPU

Copy link
Contributor

@christianbmadsen christianbmadsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the name from distribution to data_parallel

@shpface shpface merged commit 4ed61b3 into local-sim-jobs Apr 26, 2022
@shpface shpface deleted the local-sim-jobs-ddp branch April 26, 2022 21:36
@shpface
Copy link
Contributor Author

shpface commented Apr 26, 2022

Let's change the name from distribution to data_parallel

The option is kept as distribution : str rather than data_parallel : bool to leave the door open to future distribution methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants