Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed optimizer support for BERT #5305

Merged
merged 2 commits into from
Nov 14, 2022

Conversation

timmoon10
Copy link
Collaborator

@timmoon10 timmoon10 commented Nov 2, 2022

What does this PR do ?

Add support for the distributed Adam optimizer to BERT.

Collection: NLP

Changelog

  • Add support for the Apex distributed Adam optimizer to BERT

Usage

Set the optimizer to distributed_fused_adam in the config file:

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Pinging @shanmugamr1992.

Additional Information

@timmoon10
Copy link
Collaborator Author

timmoon10 commented Nov 2, 2022

Running on 4 A40s for 20 steps (in BF16 with O2 optimizations):

ZeRO Data parallel size Tensor parallel size Pipeline parallel size Throughput Train loss Val loss Allocated memory Reserved memory
No 4 3.69it/s 10.2 9.680 3855 MB 4002 MB
Yes 4 3.79it/s 10.2 9.680 2997 MB 3134 MB
No 2 2 2.99it/s 10.2 9.720 2054 MB 2374 MB
Yes 2 2 3.00it/s 10.2 9.710 1864 MB 2190 MB
No 2 2 2.42it/s 10.2 9.700 2561 MB 2580 MB
Yes 2 2 2.43it/s 10.2 9.700 2224 MB 2540 MB

The memory savings are actually more than I'd expect: the model has 55.4M parameters, so I'd expect ZeRO with 4-way data-parallelism would save 55.4M * (2 FP32 optim state + 1 16-bit master param) * (1-1/data_parallel_size)=416 MB. The distributed Adam implementation must be avoiding some other overhead or getting really good behavior out of the memory pool. It's hard to say, since BERT is already pretty small.

@okuchaiev
Copy link
Member

@timmoon10 can you please resolve merge conflicts

@timmoon10 timmoon10 marked this pull request as ready for review November 9, 2022 21:45
@timmoon10
Copy link
Collaborator Author

Running on 4 A100s for 20 steps (in BF16 with O2 optimizations):

ZeRO DP PP TP Throughput Train loss Val loss Memory usage
No 1 1 1 4.50it/s 10.2 9.850 4002 MB
Yes 1 1 1 4.61it/s 10.2 9.850 4052 MB
No 2 1 1 3.69it/s 10.2 9.800 4002 MB
Yes 2 1 1 3.81it/s 10.2 9.790 3432 MB
No 4 1 1 3.09it/s 10.1 9.700 4002 MB
Yes 4 1 1 3.16it/s 10.1 9.690 3134 MB
No 2 2 1 1.99it/s 10.1 9.660 2580 MB
Yes 2 2 1 1.94it/s 10.1 9.660 2540 MB
No 2 1 2 2.26it/s 10.1 9.740 2358 MB
Yes 2 1 2 2.22it/s 10.1 9.730 2174 MB

I think this PR is good to go.

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@shanmugamr1992 shanmugamr1992 merged commit 785057e into NVIDIA:main Nov 14, 2022
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
JimmyZhang12 pushed a commit to JimmyZhang12/NeMo that referenced this pull request Dec 14, 2022
andrusenkoau pushed a commit to andrusenkoau/NeMo that referenced this pull request Jan 5, 2023
Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: andrusenkoau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants