Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Nemo Distributed Checkpoint User Guide #11489

Merged
merged 8 commits into from
Dec 6, 2024

Conversation

FortunaZhang
Copy link
Contributor

@FortunaZhang FortunaZhang commented Dec 5, 2024

What does this PR do ?

This PR aims to update Nemo distributed checkpoint user guide documentation, with best practices for parameter tuning, mechanism, FAQs, etc.

Collection: [Note which collection this PR will affect]
Distributed checkpoint.

Changelog

  • Update dist_ckpt.rst documentation in Nemo GitHub
  • Update Introduction
  • Add Mechanism with images
  • Add Parameter Tuning Best Practices
  • Add FAQs
  • Add Glossary

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
  • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Copy link
Collaborator

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed the technical edit of docs/source/checkpoints/dist_ckpt.rst and provided some copyedits and suggested revisions.

------------
--------------

Megatron Core is an open-source, PyTorch-based library that provides a collection of GPU optimization techniques including various parallelisms (data, tensor, pipeline, context, and expert parallelism). NeMo Framework is an end to end LLM training framework that builds on top of the Megatron Core library.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix punctuation

Megatron Core is an open-source, PyTorch-based library that provides a collection of GPU optimization techniques including various parallelisms (data, tensor, pipeline, context, and expert parallelism). NeMo Framework is an end-to-end LLM training framework that builds on top of the Megatron Core library.


In large-scale training, checkpoints are used to periodically save intermediate model states (including model weights, optimizer states, and other necessary metadata). This allows for easy recovery if the training process is interrupted.

NeMo Distributed Checkpoint, part of the Megatron Core library, refers to saving the state of a distributed training job across multiple GPUs or nodes. This approach aims to reduce memory overhead and improve GPU utilization. It also allows users the flexibility to resume training using different parallelisms.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise last sentence

NeMo Distributed Checkpoint, part of the Megatron Core library, refers to saving the state of a distributed training job across multiple GPUs or nodes. This approach aims to reduce memory overhead and improve GPU utilization. It also provides users with the flexibility to resume training using different parallelism strategies.

Megatron Core provides a checkpointing library capable of handling all types of parallelisms used in LLM training.
Although the distributed checkpointing library is targeted at the Megatron Core model, it can also be used with other models, as long as proper integration is implemented.

The library provides two main entrypoints: ``dist_checkpointing.save`` and ``dist_checkpointing.load`` which are meant to replace the ``torch.save`` and ``torch.load`` in the regular checkpointing flow.
Apart from that, it provides a mechanism to define how different types of local tensors should be combined and split in the global checkpoint.


Mechanism
--------------
The NeMo Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. It employs a novel strategy **Fully Parallel Saving** (FPS) to partition the optimizer states, gradients and model parameters across all GPU ranks. When saving the checkpoint of a distributed optimizer, each DP rank holds its shard of the optimizer state and independently writes its shard to the shared storage (grad buffer).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix punctuation, remove bold, revise

The NeMo Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. It employs a novel strategy called Fully Parallel Saving (FPS) to partition the optimizer states, gradients, and model parameters across all GPU ranks. When saving the checkpoint of a distributed optimizer, each DP rank holds its shard of the optimizer state and independently writes its shard to the shared storage (grad buffer).


When loading the checkpoint, each DP rank reads its corresponding checkpoint file (shard) to recover. If different parallelism strategies are needed (e.g., tensor parallelism, pipeline parallelism), each rank can also access other checkpoint files to transfer data to the correct locations.

NeMo enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested revision

NeMo allows users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, offering the flexibility to adjust training configurations as needed.

Parameter Tuning
--------------

Distributed checkpoints in NeMo could be configured for pre-training and fine-tuning jobs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise to active voice, second person

You can configure distributed checkpoints in NeMo pre-training and fine-tuning jobs.

dist_ckpt_load_strictness: null


Here are more details of the checkpoint format options and related parameters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise

Here's a summary of the checkpoint format options and related parameters:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The distributed optimizer is a memory-optimized data-parallel deployment method. It shards the optimizer states and the high-precision master parameters across data-parallel GPUs instead of replicating them. At the parameter optimizer step, each data-parallel GPU updates its shard of parameters. Since each GPU needs its own gradient shard, the distributed optimizer conducts reduce-scatter of the parameter gradients instead of all-reduce of them. Then, the updated parameter shards are all-gathered across data-parallel GPUs. This approach significantly reduces the memory need of large-scale LLM training. Also, when the precision of the gradient is higher than the parameter precision, the split execution of gradient reduce-scatter and parameter all-gather can reduce the total communication volume. This split collective execution increases the total computation to overlap with the communication, which improves the overlap opportunity.

More information please refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FortunaZhang
Copy link
Contributor Author

Updated per above suggestions, and changed the image source, please check again

Signed-off-by: FortunaZhang <[email protected]>
Copy link
Collaborator

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve revisions

@pablo-garay pablo-garay merged commit 74c5372 into NVIDIA:main Dec 6, 2024
11 checks passed
youngeunkwon0405 pushed a commit to youngeunkwon0405/NeMo that referenced this pull request Feb 10, 2025
* Update dist_ckpt.rst with best practices

Signed-off-by: FortunaZhang <[email protected]>

* Update dist_ckpt.rst with best practices

Signed-off-by: FortunaZhang <[email protected]>

* Add files via upload

Signed-off-by: FortunaZhang <[email protected]>

* Update dist_ckpt.rst per reviewer suggestions

Signed-off-by: FortunaZhang <[email protected]>

* Update dist ckpt image source to Release assets

Signed-off-by: FortunaZhang <[email protected]>

* Update dist_ckpt.rst

Signed-off-by: FortunaZhang <[email protected]>

---------

Signed-off-by: FortunaZhang <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants