Distributed optimizer support for BERT #5305

timmoon10 · 2022-11-02T00:53:27Z

What does this PR do ?

Add support for the distributed Adam optimizer to BERT.

Collection: NLP

Changelog

Add support for the Apex distributed Adam optimizer to BERT

Usage

Set the optimizer to distributed_fused_adam in the config file:

NeMo/examples/nlp/language_modeling/conf/megatron_bert_config.yaml

Line 118 in 79a22ea

name: fused_adam

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Pinging @shanmugamr1992.

Additional Information

Depends on Pipeline paralleism in Bert #5293.
BERT distopt support in this PR is basically identical to GPT and T5 distopt support in Debug support for interleaved pipeline parallelism with the distributed Adam optimizer #5236.

timmoon10 · 2022-11-02T01:05:22Z

Running on 4 A40s for 20 steps (in BF16 with O2 optimizations):

ZeRO	Data parallel size	Tensor parallel size	Pipeline parallel size	Throughput	Train loss	Val loss	Allocated memory	Reserved memory
No	4			3.69it/s	10.2	9.680	3855 MB	4002 MB
Yes	4			3.79it/s	10.2	9.680	2997 MB	3134 MB
No	2	2		2.99it/s	10.2	9.720	2054 MB	2374 MB
Yes	2	2		3.00it/s	10.2	9.710	1864 MB	2190 MB
No	2		2	2.42it/s	10.2	9.700	2561 MB	2580 MB
Yes	2		2	2.43it/s	10.2	9.700	2224 MB	2540 MB

The memory savings are actually more than I'd expect: the model has 55.4M parameters, so I'd expect ZeRO with 4-way data-parallelism would save 55.4M * (2 FP32 optim state + 1 16-bit master param) * (1-1/data_parallel_size)=416 MB. The distributed Adam implementation must be avoiding some other overhead or getting really good behavior out of the memory pool. It's hard to say, since BERT is already pretty small.

okuchaiev · 2022-11-09T18:30:43Z

@timmoon10 can you please resolve merge conflicts

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2022-11-10T22:04:53Z

Running on 4 A100s for 20 steps (in BF16 with O2 optimizations):

ZeRO	DP	PP	TP	Throughput	Train loss	Val loss	Memory usage
No	1	1	1	4.50it/s	10.2	9.850	4002 MB
Yes	1	1	1	4.61it/s	10.2	9.850	4052 MB
No	2	1	1	3.69it/s	10.2	9.800	4002 MB
Yes	2	1	1	3.81it/s	10.2	9.790	3432 MB
No	4	1	1	3.09it/s	10.1	9.700	4002 MB
Yes	4	1	1	3.16it/s	10.1	9.690	3134 MB
No	2	2	1	1.99it/s	10.1	9.660	2580 MB
Yes	2	2	1	1.94it/s	10.1	9.660	2540 MB
No	2	1	2	2.26it/s	10.1	9.740	2358 MB
Yes	2	1	2	2.22it/s	10.1	9.730	2174 MB

I think this PR is good to go.

ericharper

LGTM. Thanks!

Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]> Signed-off-by: andrusenkoau <[email protected]>

timmoon10 force-pushed the bert-distopt branch from edd25e0 to 66c4a00 Compare November 9, 2022 21:45

timmoon10 marked this pull request as ready for review November 9, 2022 21:45

Add distributed Adam support for BERT

c95b17e

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the bert-distopt branch from 66c4a00 to c95b17e Compare November 10, 2022 21:44

Merge branch 'main' into bert-distopt

af4ae99

ericharper approved these changes Nov 11, 2022

View reviewed changes

shanmugamr1992 merged commit 785057e into NVIDIA:main Nov 14, 2022

hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022

Add distributed Adam support for BERT (NVIDIA#5305)

46d91a0

Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022

Add distributed Adam support for BERT (NVIDIA#5305)

652b005

Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

JimmyZhang12 pushed a commit to JimmyZhang12/NeMo that referenced this pull request Dec 14, 2022

Add distributed Adam support for BERT (NVIDIA#5305)

ce39a0d

Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]>

andrusenkoau pushed a commit to andrusenkoau/NeMo that referenced this pull request Jan 5, 2023

Add distributed Adam support for BERT (NVIDIA#5305)

686676e

Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]> Signed-off-by: andrusenkoau <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed optimizer support for BERT #5305

Distributed optimizer support for BERT #5305

timmoon10 commented Nov 2, 2022 •

edited

Loading

timmoon10 commented Nov 2, 2022 •

edited

Loading

okuchaiev commented Nov 9, 2022

timmoon10 commented Nov 10, 2022

ericharper left a comment

Distributed optimizer support for BERT #5305

Distributed optimizer support for BERT #5305

Conversation

timmoon10 commented Nov 2, 2022 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

timmoon10 commented Nov 2, 2022 • edited Loading

okuchaiev commented Nov 9, 2022

timmoon10 commented Nov 10, 2022

ericharper left a comment

Choose a reason for hiding this comment

timmoon10 commented Nov 2, 2022 •

edited

Loading

timmoon10 commented Nov 2, 2022 •

edited

Loading