New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Update Nemo Distributed Checkpoint User Guide #11489

Merged

pablo-garay merged 8 commits into NVIDIA:main from FortunaZhang:dist_ckpt

Dec 6, 2024

Contributor

FortunaZhang commented Dec 5, 2024 •

edited

Loading

What does this PR do ?

This PR aims to update Nemo distributed checkpoint user guide documentation, with best practices for parameter tuning, mechanism, FAQs, etc.

Collection: [Note which collection this PR will affect]
Distributed checkpoint.

Changelog

Update dist_ckpt.rst documentation in Nemo GitHub
Update Introduction
Add Mechanism with images
Add Parameter Tuning Best Practices
Add FAQs
Add Glossary

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

FortunaZhang added 4 commits

December 5, 2024 14:31


          Update dist_ckpt.rst with best practices

f72e56d

Signed-off-by: FortunaZhang <[email protected]>


          Update dist_ckpt.rst with best practices

0cfe03d

Signed-off-by: FortunaZhang <[email protected]>


          Add files via upload

Signed-off-by: FortunaZhang <[email protected]>


          Merge branch 'NVIDIA:main' into dist_ckpt

8f4bf13

jgerh reviewed

View reviewed changes

Collaborator

jgerh left a comment

Completed the technical edit of docs/source/checkpoints/dist_ckpt.rst and provided some copyedits and suggested revisions.

docs/source/checkpoints/dist_ckpt.rst Outdated

-              ------------
+              --------------
+              Megatron Core is an open-source, PyTorch-based library that provides a collection of GPU optimization techniques including various parallelisms (data, tensor, pipeline, context, and expert parallelism). NeMo Framework is an end to end LLM training framework that builds on top of the Megatron Core library.

Collaborator

jgerh Dec 5, 2024

fix punctuation

Megatron Core is an open-source, PyTorch-based library that provides a collection of GPU optimization techniques including various parallelisms (data, tensor, pipeline, context, and expert parallelism). NeMo Framework is an end-to-end LLM training framework that builds on top of the Megatron Core library.

docs/source/checkpoints/dist_ckpt.rst Outdated


		In large-scale training, checkpoints are used to periodically save intermediate model states (including model weights, optimizer states, and other necessary metadata). This allows for easy recovery if the training process is interrupted.

		NeMo Distributed Checkpoint, part of the Megatron Core library, refers to saving the state of a distributed training job across multiple GPUs or nodes. This approach aims to reduce memory overhead and improve GPU utilization. It also allows users the flexibility to resume training using different parallelisms.

Collaborator

jgerh Dec 5, 2024

revise last sentence

NeMo Distributed Checkpoint, part of the Megatron Core library, refers to saving the state of a distributed training job across multiple GPUs or nodes. This approach aims to reduce memory overhead and improve GPU utilization. It also provides users with the flexibility to resume training using different parallelism strategies.

docs/source/checkpoints/dist_ckpt.rst Outdated

               Megatron Core provides a checkpointing library capable of handling all types of parallelisms used in LLM training.
               Although the distributed checkpointing library is targeted at the Megatron Core model, it can also be used with other models, as long as proper integration is implemented.
               The library provides two main entrypoints: ``dist_checkpointing.save`` and ``dist_checkpointing.load`` which are meant to replace the ``torch.save`` and ``torch.load`` in the regular checkpointing flow.
               Apart from that, it provides a mechanism to define how different types of local tensors should be combined and split in the global checkpoint.
+              Mechanism
+              --------------
+              The NeMo Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. It employs a novel strategy **Fully Parallel Saving** (FPS) to partition the optimizer states, gradients and model parameters across all GPU ranks. When saving the checkpoint of a distributed optimizer, each DP rank holds its shard of the optimizer state and independently writes its shard to the shared storage (grad buffer).

Collaborator

jgerh Dec 5, 2024

fix punctuation, remove bold, revise

The NeMo Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. It employs a novel strategy called Fully Parallel Saving (FPS) to partition the optimizer states, gradients, and model parameters across all GPU ranks. When saving the checkpoint of a distributed optimizer, each DP rank holds its shard of the optimizer state and independently writes its shard to the shared storage (grad buffer).

docs/source/checkpoints/dist_ckpt.rst Outdated


		When loading the checkpoint, each DP rank reads its corresponding checkpoint file (shard) to recover. If different parallelism strategies are needed (e.g., tensor parallelism, pipeline parallelism), each rank can also access other checkpoint files to transfer data to the correct locations.

		NeMo enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training.

Collaborator

jgerh Dec 5, 2024

suggested revision

NeMo allows users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, offering the flexibility to adjust training configurations as needed.

docs/source/checkpoints/dist_ckpt.rst Outdated

+              Parameter Tuning
+              --------------
+              Distributed checkpoints in NeMo could be configured for pre-training and fine-tuning jobs.

Collaborator

jgerh Dec 5, 2024

revise to active voice, second person

You can configure distributed checkpoints in NeMo pre-training and fine-tuning jobs.

docs/source/checkpoints/dist_ckpt.rst Outdated

		dist_ckpt_load_strictness: null


		Here are more details of the checkpoint format options and related parameters:

Collaborator

jgerh Dec 5, 2024

revise

Here's a summary of the checkpoint format options and related parameters:

docs/source/checkpoints/dist_ckpt.rst Outdated

+              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+              The distributed optimizer is a memory-optimized data-parallel deployment method. It shards the optimizer states and the high-precision master parameters across data-parallel GPUs instead of replicating them. At the parameter optimizer step, each data-parallel GPU updates its shard of parameters. Since each GPU needs its own gradient shard, the distributed optimizer conducts reduce-scatter of the parameter gradients instead of all-reduce of them. Then, the updated parameter shards are all-gathered across data-parallel GPUs. This approach significantly reduces the memory need of large-scale LLM training. Also, when the precision of the gradient is higher than the parameter precision, the split execution of gradient reduce-scatter and parameter all-gather can reduce the total communication volume. This split collective execution increases the total computation to overlap with the communication, which improves the overlap opportunity.
+              More information please refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html.

Collaborator

jgerh Dec 5, 2024

revise

For more information, please refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html.

FortunaZhang added 2 commits

December 5, 2024 18:42


          Update dist_ckpt.rst per reviewer suggestions

fec0e68

Signed-off-by: FortunaZhang <[email protected]>


          Update dist ckpt image source to Release assets

03c81bc

Signed-off-by: FortunaZhang <[email protected]>

Contributor Author

FortunaZhang commented Dec 5, 2024

Updated per above suggestions, and changed the image source, please check again


          Update dist_ckpt.rst

91a03d7

Signed-off-by: FortunaZhang <[email protected]>

jgerh approved these changes

View reviewed changes

Collaborator

jgerh left a comment

Approve revisions


          Merge branch 'NVIDIA:main' into dist_ckpt

yaoyu-33 approved these changes

View reviewed changes

pablo-garay merged commit 74c5372 into NVIDIA:main

11 checks passed

youngeunkwon0405 pushed a commit to youngeunkwon0405/NeMo that referenced this pull request


          Update Nemo Distributed Checkpoint User Guide (NVIDIA#11489)

04edfa5

* Update dist_ckpt.rst with best practices

Signed-off-by: FortunaZhang <[email protected]>

* Update dist_ckpt.rst with best practices

Signed-off-by: FortunaZhang <[email protected]>

* Add files via upload

Signed-off-by: FortunaZhang <[email protected]>

* Update dist_ckpt.rst per reviewer suggestions

Signed-off-by: FortunaZhang <[email protected]>

* Update dist ckpt image source to Release assets

Signed-off-by: FortunaZhang <[email protected]>

* Update dist_ckpt.rst

Signed-off-by: FortunaZhang <[email protected]>

---------

Signed-off-by: FortunaZhang <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet