Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeepSpeed support #82

Merged
merged 19 commits into from
May 27, 2021
Merged

Add DeepSpeed support #82

merged 19 commits into from
May 27, 2021

Conversation

thevasudevgupta
Copy link
Contributor

@thevasudevgupta thevasudevgupta commented May 13, 2021

This PR will add DeepSpeed support to Accelerate.

User's code should look like this:

from accelerate import Accelerator

model: torch.nn.Module
optimizer: torch.optim.Optimizer
tr_data: torch.utils.data.DataLoader
eval_data: torch.utils.data.DataLoader

gradient_accumulation_steps: int
epochs: int

accelerator = Accelerator(fp16=True)
model, optimizer, tr_data, eval_data = accelerator.prepare(model, optimizer, tr_data, eval_data)

# training loop
for e in range(epochs):
    for step, batch in enumerate(tr_data):
        output = model(**batch)
        loss = output.loss / gradient_accumulation_steps
        accelerator.backward(loss)
        if (step + 1) % gradient_accumulation_steps == 0 or step == len(train_data) - 1:
            optimizer.step()
            optimizer.zero_grad()
-           lr_scheduler.step()
+           if not optimizer.is_overflow:
+               lr_scheduler.step()

# distributed evaluation
model.eval()
for batch in eval_data:
    output = model(**batch)
    predictions = accelerator.gather(output)
    labels = accelerator.gather(batch["labels"])

# for saving your model
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)

# if its a HuggingFace Transformer
- unwrapped_model.save_pretrained(args.output_dir, save_function=accelerator.save)
+ unwrapped_model.save_pretrained(args.output_dir, state_dict=accelerator.get_state_dict(model), save_function=accelerator.save)

Running the script

accelerate config
accelerate launch <training_script.py> <training_args>

@sgugger

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work on this! This is great! I left a couple of comments to polish the implementation.

src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Show resolved Hide resolved
src/accelerate/deepspeed_utils.py Outdated Show resolved Hide resolved
src/accelerate/deepspeed_utils.py Outdated Show resolved Hide resolved
src/accelerate/optimizer.py Outdated Show resolved Hide resolved
src/accelerate/state.py Outdated Show resolved Hide resolved
src/accelerate/state.py Outdated Show resolved Hide resolved
@sgugger
Copy link
Collaborator

sgugger commented May 17, 2021

Note for before merging this: the PR should be rebased on master and the reference to the model added in the Accelerator should be cleaned up in the method introduced in the PR above.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all your amazing work! I think we are in good shape to have this merge, I just left some last comments around naming.

src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Show resolved Hide resolved
src/accelerate/accelerator.py Show resolved Hide resolved
src/accelerate/commands/config/cluster.py Outdated Show resolved Hide resolved
src/accelerate/commands/launch.py Outdated Show resolved Hide resolved
src/accelerate/optimizer.py Show resolved Hide resolved
src/run_glue_no_trainer.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
@sgugger sgugger requested a review from LysandreJik May 26, 2021 15:42
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, great job implementing that @vasudevgupta7!

Two items that should be handled in my opinion:

  • It seems that accelerate will not run once this PR is merged if deepspeed isn't in the environment. Is this correct?
  • There really should be some tests to ensure that everything is working correctly. I would advocate for a test suite that runs accelerate with and without deepspeed, but I know the infra to support that isn't there yet.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/deepspeed_utils.py Outdated Show resolved Hide resolved
src/accelerate/deepspeed_utils.py Outdated Show resolved Hide resolved
src/accelerate/deepspeed_utils.py Outdated Show resolved Hide resolved
src/accelerate/deepspeed_utils.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/commands/config/cluster.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few last tweaks to the doc and this should be good to merge!

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@thevasudevgupta thevasudevgupta changed the title [WIP] Add DeepSpeed support Add DeepSpeed support May 27, 2021
@sgugger sgugger merged commit f1333b5 into huggingface:main May 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants