SageMaker DP Support #494

pacman100 · 2022-07-07T15:47:00Z

What does this PR do?

Fixes bugs in the current SageMaker support to get it working properly using the latest HF DLC
Adds DATA_PARALLEL support. Run accelerate config and answer the questions and choose DATA_PARALLEL for SageMaker Distribution Type. Sample config below with XXXXX values being AWS account specific.

base_job_name: accelerate-sagemaker-1
compute_environment: AMAZON_SAGEMAKER
distributed_type: DATA_PARALLEL
ec2_instance_type: ml.p3.16xlarge
iam_role_name: XXXXX
mixed_precision: fp16
num_machines: 1
profile: XXXXX
py_version: py38
pytorch_version: 1.10.2
region: us-east-1
transformers_version: 4.17.0
use_cpu: false

After above config, just run the below command to run the official NLP example:

cd acceelerate/examples
accelerate launch complete_nlp_example.py

The output logs:

HuggingFaceDocBuilderDev · 2022-07-07T15:49:50Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Looking good at first glance! Let's focus on DataParallel only as AWS told us they were completely changing the API for model parallelism.

sgugger

Thanks for working on this! Let's also see what @philschmid thinks since he worked on the sagemaker command in Accelerate.

src/accelerate/state.py

plamb-viso · 2022-07-07T21:30:59Z

I will test this with my use case as soon as it gets merged into main

philschmid

LGTM! How should we maintain the SAGEMAKER_PYTHON_VERSION, SAGEMAKER_PYTORCH_VERSION, SAGEMAKER_TRANSFORMERS_VERSION versions once we release new DLCs?

pacman100 · 2022-07-08T18:43:08Z

LGTM! How should we maintain the SAGEMAKER_PYTHON_VERSION, SAGEMAKER_PYTORCH_VERSION, SAGEMAKER_TRANSFORMERS_VERSION versions once we release new DLCs?

Hello, as it is not a frequent change, it can be done manually for time being whenever new releases of DLC happen. Any suggestions/best practices for automating it?

HebaGamalElDin · 2022-09-21T06:46:26Z

Hello @pacman100, what is the sagemaker SDK estimator supposed to be used with accelerate?

SageMaker DP and MP Support

ce3f933

sgugger reviewed Jul 7, 2022

View reviewed changes

pacman100 changed the title ~~SageMaker DP and MP Support~~ SageMaker DP Support Jul 7, 2022

fix 😅

a094c6c

pacman100 mentioned this pull request Jul 7, 2022

Cannot run distributed training on sagemaker #492

Closed

pacman100 marked this pull request as ready for review July 7, 2022 20:46

pacman100 requested a review from sgugger July 7, 2022 20:47

sgugger approved these changes Jul 7, 2022

View reviewed changes

src/accelerate/state.py Outdated Show resolved Hide resolved

pacman100 requested a review from philschmid July 7, 2022 20:59

removing SageMaker MP option

6ef2d6c

philschmid approved these changes Jul 8, 2022

View reviewed changes

pacman100 merged commit a0514dd into huggingface:main Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SageMaker DP Support #494

SageMaker DP Support #494

pacman100 commented Jul 7, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 7, 2022 •

edited

Loading

sgugger left a comment

sgugger left a comment

plamb-viso commented Jul 7, 2022

philschmid left a comment

pacman100 commented Jul 8, 2022

HebaGamalElDin commented Sep 21, 2022

SageMaker DP Support #494

SageMaker DP Support #494

Conversation

pacman100 commented Jul 7, 2022 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jul 7, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

plamb-viso commented Jul 7, 2022

philschmid left a comment

Choose a reason for hiding this comment

pacman100 commented Jul 8, 2022

HebaGamalElDin commented Sep 21, 2022

pacman100 commented Jul 7, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 7, 2022 •

edited

Loading