Add DeepSpeed support #82

thevasudevgupta · 2021-05-13T12:29:36Z

This PR will add DeepSpeed support to Accelerate.

User's code should look like this:

from accelerate import Accelerator

model: torch.nn.Module
optimizer: torch.optim.Optimizer
tr_data: torch.utils.data.DataLoader
eval_data: torch.utils.data.DataLoader

gradient_accumulation_steps: int
epochs: int

accelerator = Accelerator(fp16=True)
model, optimizer, tr_data, eval_data = accelerator.prepare(model, optimizer, tr_data, eval_data)

# training loop
for e in range(epochs):
    for step, batch in enumerate(tr_data):
        output = model(**batch)
        loss = output.loss / gradient_accumulation_steps
        accelerator.backward(loss)
        if (step + 1) % gradient_accumulation_steps == 0 or step == len(train_data) - 1:
            optimizer.step()
            optimizer.zero_grad()
-           lr_scheduler.step()
+           if not optimizer.is_overflow:
+               lr_scheduler.step()

# distributed evaluation
model.eval()
for batch in eval_data:
    output = model(**batch)
    predictions = accelerator.gather(output)
    labels = accelerator.gather(batch["labels"])

# for saving your model
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)

# if its a HuggingFace Transformer
- unwrapped_model.save_pretrained(args.output_dir, save_function=accelerator.save)
+ unwrapped_model.save_pretrained(args.output_dir, state_dict=accelerator.get_state_dict(model), save_function=accelerator.save)

Running the script

accelerate config
accelerate launch <training_script.py> <training_args>

@sgugger

sgugger

Thanks a lot for your work on this! This is great! I left a couple of comments to polish the implementation.

src/accelerate/accelerator.py

src/accelerate/deepspeed_utils.py

src/accelerate/optimizer.py

src/accelerate/state.py

sgugger · 2021-05-17T13:57:47Z

Note for before merging this: the PR should be rebased on master and the reference to the model added in the Accelerator should be cleaned up in the method introduced in the PR above.

sgugger

Thanks a lot for all your amazing work! I think we are in good shape to have this merge, I just left some last comments around naming.

src/accelerate/accelerator.py

src/accelerate/commands/config/cluster.py

src/accelerate/commands/launch.py

src/accelerate/optimizer.py

src/run_glue_no_trainer.py

src/accelerate/accelerator.py

LysandreJik

Cool, great job implementing that @vasudevgupta7!

Two items that should be handled in my opinion:

It seems that accelerate will not run once this PR is merged if deepspeed isn't in the environment. Is this correct?
There really should be some tests to ensure that everything is working correctly. I would advocate for a test suite that runs accelerate with and without deepspeed, but I know the infra to support that isn't there yet.

README.md

src/accelerate/accelerator.py

src/accelerate/deepspeed_utils.py

src/accelerate/accelerator.py

src/accelerate/commands/config/cluster.py

sgugger

Just a few last tweaks to the doc and this should be good to merge!

README.md

Co-authored-by: Sylvain Gugger <[email protected]>

thevasudevgupta force-pushed the deepspeed branch from 354a60e to 547ec93 Compare May 13, 2021 12:37

sgugger reviewed May 14, 2021

View reviewed changes

thevasudevgupta force-pushed the deepspeed branch from 7189100 to 63555d4 Compare May 15, 2021 19:36

sgugger mentioned this pull request May 17, 2021

Add Accelerator.free_memory #89

Merged

thevasudevgupta added 6 commits May 20, 2021 17:26

add script with acclerator

7516e1e

squash

4ca24cc

save progess

c9a561a

fix some; deepspeed giving error now

c7dd587

fixed everything

877ed65

rebased

acde238

thevasudevgupta force-pushed the deepspeed branch from 63555d4 to acde238 Compare May 20, 2021 11:59

thevasudevgupta and others added 3 commits May 20, 2021 13:13

stage2 fix

264484d

fix optimizer cpu offload

ea1b77b

small fix

4b565cd

sgugger approved these changes May 26, 2021

View reviewed changes

sgugger requested a review from LysandreJik May 26, 2021 15:42

thevasudevgupta added 2 commits May 26, 2021 21:05

fix suggestions

06b2abe

update readme

54e3834

LysandreJik reviewed May 27, 2021

View reviewed changes

sgugger reviewed May 27, 2021

View reviewed changes

src/accelerate/commands/config/cluster.py Outdated Show resolved Hide resolved

thevasudevgupta added 6 commits May 27, 2021 17:12

fix suggestions

f9e7db5

extract fp16 state_dict

da7fb8d

remove deepspeed dependency

d72323e

add fp16-32 conversion; readme update

73d6a59

remove run script

78475b1

make style

9e7e8aa

sgugger approved these changes May 27, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

thevasudevgupta changed the title ~~[WIP] Add DeepSpeed support~~ Add DeepSpeed support May 27, 2021

Apply suggestions from code review

2e03284

Co-authored-by: Sylvain Gugger <[email protected]>

make quality

e14f0f5

sgugger merged commit f1333b5 into huggingface:main May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSpeed support #82

Add DeepSpeed support #82

thevasudevgupta commented May 13, 2021 •

edited

Loading

sgugger left a comment

sgugger commented May 17, 2021

sgugger left a comment

LysandreJik left a comment

sgugger left a comment

Add DeepSpeed support #82

Add DeepSpeed support #82

Conversation

thevasudevgupta commented May 13, 2021 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger commented May 17, 2021

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

thevasudevgupta commented May 13, 2021 •

edited

Loading