-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Nemo Distributed Checkpoint User Guide #11489
Conversation
Signed-off-by: FortunaZhang <[email protected]>
Signed-off-by: FortunaZhang <[email protected]>
Signed-off-by: FortunaZhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed the technical edit of docs/source/checkpoints/dist_ckpt.rst and provided some copyedits and suggested revisions.
------------ | ||
-------------- | ||
|
||
Megatron Core is an open-source, PyTorch-based library that provides a collection of GPU optimization techniques including various parallelisms (data, tensor, pipeline, context, and expert parallelism). NeMo Framework is an end to end LLM training framework that builds on top of the Megatron Core library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix punctuation
Megatron Core is an open-source, PyTorch-based library that provides a collection of GPU optimization techniques including various parallelisms (data, tensor, pipeline, context, and expert parallelism). NeMo Framework is an end-to-end LLM training framework that builds on top of the Megatron Core library.
|
||
In large-scale training, checkpoints are used to periodically save intermediate model states (including model weights, optimizer states, and other necessary metadata). This allows for easy recovery if the training process is interrupted. | ||
|
||
NeMo Distributed Checkpoint, part of the Megatron Core library, refers to saving the state of a distributed training job across multiple GPUs or nodes. This approach aims to reduce memory overhead and improve GPU utilization. It also allows users the flexibility to resume training using different parallelisms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revise last sentence
NeMo Distributed Checkpoint, part of the Megatron Core library, refers to saving the state of a distributed training job across multiple GPUs or nodes. This approach aims to reduce memory overhead and improve GPU utilization. It also provides users with the flexibility to resume training using different parallelism strategies.
Megatron Core provides a checkpointing library capable of handling all types of parallelisms used in LLM training. | ||
Although the distributed checkpointing library is targeted at the Megatron Core model, it can also be used with other models, as long as proper integration is implemented. | ||
|
||
The library provides two main entrypoints: ``dist_checkpointing.save`` and ``dist_checkpointing.load`` which are meant to replace the ``torch.save`` and ``torch.load`` in the regular checkpointing flow. | ||
Apart from that, it provides a mechanism to define how different types of local tensors should be combined and split in the global checkpoint. | ||
|
||
|
||
Mechanism | ||
-------------- | ||
The NeMo Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. It employs a novel strategy **Fully Parallel Saving** (FPS) to partition the optimizer states, gradients and model parameters across all GPU ranks. When saving the checkpoint of a distributed optimizer, each DP rank holds its shard of the optimizer state and independently writes its shard to the shared storage (grad buffer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix punctuation, remove bold, revise
The NeMo Distributed Checkpoint enables saving and loading models from multiple ranks in parallel. It employs a novel strategy called Fully Parallel Saving (FPS) to partition the optimizer states, gradients, and model parameters across all GPU ranks. When saving the checkpoint of a distributed optimizer, each DP rank holds its shard of the optimizer state and independently writes its shard to the shared storage (grad buffer).
|
||
When loading the checkpoint, each DP rank reads its corresponding checkpoint file (shard) to recover. If different parallelism strategies are needed (e.g., tensor parallelism, pipeline parallelism), each rank can also access other checkpoint files to transfer data to the correct locations. | ||
|
||
NeMo enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested revision
NeMo allows users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, offering the flexibility to adjust training configurations as needed.
Parameter Tuning | ||
-------------- | ||
|
||
Distributed checkpoints in NeMo could be configured for pre-training and fine-tuning jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revise to active voice, second person
You can configure distributed checkpoints in NeMo pre-training and fine-tuning jobs.
dist_ckpt_load_strictness: null | ||
|
||
|
||
Here are more details of the checkpoint format options and related parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revise
Here's a summary of the checkpoint format options and related parameters:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
The distributed optimizer is a memory-optimized data-parallel deployment method. It shards the optimizer states and the high-precision master parameters across data-parallel GPUs instead of replicating them. At the parameter optimizer step, each data-parallel GPU updates its shard of parameters. Since each GPU needs its own gradient shard, the distributed optimizer conducts reduce-scatter of the parameter gradients instead of all-reduce of them. Then, the updated parameter shards are all-gathered across data-parallel GPUs. This approach significantly reduces the memory need of large-scale LLM training. Also, when the precision of the gradient is higher than the parameter precision, the split execution of gradient reduce-scatter and parameter all-gather can reduce the total communication volume. This split collective execution increases the total computation to overlap with the communication, which improves the overlap opportunity. | ||
|
||
More information please refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revise
For more information, please refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html.
Signed-off-by: FortunaZhang <[email protected]>
Signed-off-by: FortunaZhang <[email protected]>
Updated per above suggestions, and changed the image source, please check again |
Signed-off-by: FortunaZhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve revisions
* Update dist_ckpt.rst with best practices Signed-off-by: FortunaZhang <[email protected]> * Update dist_ckpt.rst with best practices Signed-off-by: FortunaZhang <[email protected]> * Add files via upload Signed-off-by: FortunaZhang <[email protected]> * Update dist_ckpt.rst per reviewer suggestions Signed-off-by: FortunaZhang <[email protected]> * Update dist ckpt image source to Release assets Signed-off-by: FortunaZhang <[email protected]> * Update dist_ckpt.rst Signed-off-by: FortunaZhang <[email protected]> --------- Signed-off-by: FortunaZhang <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]>
What does this PR do ?
This PR aims to update Nemo distributed checkpoint user guide documentation, with best practices for parameter tuning, mechanism, FAQs, etc.
Collection: [Note which collection this PR will affect]
Distributed checkpoint.
Changelog
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.