Yuya/add checkpoints section (NVIDIA#9329)

* Add checkpoints section Signed-off-by: yaoyu-33 <[email protected]> * Fix title Signed-off-by: yaoyu-33 <[email protected]> * update Signed-off-by: yaoyu-33 <[email protected]> * Add section on ".qnemo" checkpoints (NVIDIA#9503) * Add 'Quantized Checkpoints' section Signed-off-by: Jan Lasek <[email protected]> * Address review comments Signed-off-by: Jan Lasek <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> * Distributed checkpointing user guide (NVIDIA#9494) * Describe shardings and entrypoints Signed-off-by: Mikołaj Błaż <[email protected]> * Strategies, optimizers, finalize entrypoints Signed-off-by: Mikołaj Błaż <[email protected]> * Transformations Signed-off-by: Mikołaj Błaż <[email protected]> * Integration Signed-off-by: Mikołaj Błaż <[email protected]> * Add link from intro Signed-off-by: Mikołaj Błaż <[email protected]> * Apply grammar suggestions Signed-off-by: Mikołaj Błaż <[email protected]> * Explain the example Signed-off-by: Mikołaj Błaż <[email protected]> * Apply review suggestions Signed-off-by: Mikołaj Błaż <[email protected]> * Add zarr and torch_dist explanation --------- Signed-off-by: Mikołaj Błaż <[email protected]> * add subsection Signed-off-by: yaoyu-33 <[email protected]> * Update docs/source/checkpoints/intro.rst Co-authored-by: Chen Cui <[email protected]> Signed-off-by: Yu Yao <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * fix code block Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> * formatting Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: Yu Yao <[email protected]> Co-authored-by: Jan Lasek <[email protected]> Co-authored-by: mikolajblaz <[email protected]> Co-authored-by: Chen Cui <[email protected]> Signed-off-by: Malay Nagda <[email protected]>
malay-nagda · Jul 26, 2024 · c06e0ae · c06e0ae
1 parent 961dcf2
commit c06e0ae
Show file tree

Hide file tree

Showing 7 changed files with 494 additions and 25 deletions.
diff --git a/docs/source/ckpt_converters/convert_mlm.rst → docs/source/checkpoints/convert_mlm.rst b/docs/source/ckpt_converters/convert_mlm.rst → docs/source/checkpoints/convert_mlm.rst
diff --git a/docs/source/ckpt_converters/dev_guide.rst → docs/source/checkpoints/dev_guide.rst b/docs/source/ckpt_converters/dev_guide.rst → docs/source/checkpoints/dev_guide.rst
diff --git a/docs/source/checkpoints/dist_ckpt.rst b/docs/source/checkpoints/dist_ckpt.rst
diff --git a/docs/source/checkpoints/intro.rst b/docs/source/checkpoints/intro.rst
@@ -0,0 +1,64 @@
+Checkpoints
+===========
+
+
+In this section, we present key functionalities of NVIDIA NeMo related to checkpoint management.
+
+Understanding Checkpoint Formats
+--------------------------------
+
+A ``.nemo`` checkpoint is fundamentally a tar file that bundles the model configurations (specified inside a YAML file), model weights (inside a ``.ckpt`` file), and other artifacts like tokenizer models or vocabulary files. This consolidated design streamlines sharing, loading, tuning, evaluating, and inference.
+
+In contrast, the ``.ckpt`` file, created during PyTorch Lightning training, contains both the model weights and the optimizer states, and is usually used to resume training.
+
+Sharded Model Weights
+---------------------
+
+Within ``.nemo`` or ``.ckpt`` checkpoints, the model weights could be saved in either a regular format (one file called ``model_weights.ckpt`` inside model parallelism folders) or a sharded format (a folder called ``model_weights``).
+
+With sharded model weights, you can save and load the state of your training script with multiple GPUs or nodes more efficiently and avoid the need to change model partitions when you resume tuning with a different model parallelism setup.
+
+NeMo supports the distributed (sharded) checkpoint format from Megatron Core. Megatron Core supports two checkpoint backends: PyTorch-based (recommended) and Zarr-based (deprecated).
+For a detailed explanation check the :doc:`dist_ckpt` guide.
+
+
+Quantized Checkpoints
+---------------------
+
+NeMo provides a :doc:`Post-Training Quantization <../nlp/quantization>` workflow that allows you to convert regular ``.nemo`` models into a `TensorRT-LLM checkpoint <https://nvidia.github.io/TensorRT-LLM/architecture/checkpoint.html>`_, commonly referred to as ``.qnemo`` checkpoints in NeMo. These ``.qnemo`` checkpoints can then be used with the `NVIDIA TensorRT-LLM library <https://nvidia.github.io/TensorRT-LLM/index.html>`_ for efficient inference.
+
+A ``.qnemo`` checkpoint, similar to ``.nemo`` checkpoints, is a tar file that bundles the model configuration specified in the ``config.json`` file along with the ``rank{i}.safetensors`` files. These ``.safetensors`` files store the model weights for each rank individually. In addition, a ``tokenizer_config.yaml`` file is saved, containing only the tokenizer section from the original NeMo ``model_config.yaml`` file. This configuration file defines the tokenizer used by the given model.
+
+When working with large quantized LLMs, it is recommended that you leave the checkpoint uncompressed as a directory rather than a tar file. You can control this behavior by setting the ``compress`` flag when exporting quantized models in `PTQ configuration file <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_ptq.yaml>`_.
+
+The following example shows the contents of a quantized model intended to be served using two GPUs (ranks):
+
+.. code-block:: bash
+
+    model-qnemo
+    ├── config.json
+    ├── rank0.safetensors
+    ├── rank1.safetensors
+    ├── tokenizer.model
+    └── tokenizer_config.yaml
+
+Community Checkpoint Converter
+-----------------------------
+We provide easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Supervised Fine-Tuning (SFT), Parameter-Efficient Fine-Tuning (PEFT), and deployment. For detailed instructions and guidelines, please refer to our documentation.
+
+We offer comprehensive guides to assist both end users and developers:
+
+- **User Guide**: Detailed steps on how to convert community model checkpoints for further training or deployment within NeMo. For more information, please see our :doc:`user_guide`.
+
+- **Developer Guide**: Instructions for developers on how to implement converters for community model checkpoints, allowing for broader compatibility and integration within the NeMo ecosystem. For development details, refer to our :doc:`dev_guide`.
+
+- **Megatron-LM Checkpoint Conversion**: NVIDIA NeMo and NVIDIA Megatron-LM share several foundational technologies. You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo Framework using our scripts, see our :doc:`convert_mlm`.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: NeMo Checkpoints
+
+   dist_ckpt
+   user_guide
+   dev_guide
+   convert_mlm
diff --git a/docs/source/ckpt_converters/user_guide.rst → docs/source/checkpoints/user_guide.rst b/docs/source/ckpt_converters/user_guide.rst → docs/source/checkpoints/user_guide.rst
diff --git a/docs/source/ckpt_converters/intro.rst b/docs/source/ckpt_converters/intro.rst
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -58,10 +58,10 @@ For more information, browse the developer docs for your area of interest in the
 
 .. toctree::
    :maxdepth: 1
-   :caption: Community Model Converters
-   :name: CheckpointConverters
+   :caption: Model Checkpoints
+   :name: Checkpoints
 
-   ckpt_converters/intro
+   checkpoints/intro
 
 .. toctree::
    :maxdepth: 1