Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section on ".qnemo" checkpoints #9503

Merged

Conversation

janekl
Copy link
Collaborator

@janekl janekl commented Jun 19, 2024

What does this PR do ?

Add section on ".qnemo" checkpoints to #9329.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)


NeMo also offers :doc:`Post-Training Quantization <../nlp/quantization>` workflow to convert regular ``.nemo`` models into a `TensorRT-LLM checkpoint <https://nvidia.github.io/TensorRT-LLM/architecture/checkpoint.html>`_ conventionally referred to as ``.qnemo`` checkpoints in NeMo. Such a checkpoint can be used with `NVIDIA TensorRT-LLM library <https://nvidia.github.io/TensorRT-LLM/index.html>`_ for efficient inference.

Much as in the case of ``.nemo`` checkpoints, a ``.qnemo`` checkpoint is a tar file that bundles the model configuration given in ``config.json`` file and ``rank{i}.safetensors`` files storing model weights for each rank separately. Additionally a ``tokenizer_config.yaml`` file is saved which is just ``tokenizer`` section from ``model_config.yaml`` file from the original NeMo model. This configuration file defines a tokenizer for the model given.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.qnemo would not support distributed checkpoint format? i.e. you saved with world_size 2 and have to load with world_size 2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to clarify is that these config.json + rank{i}.safetensors output is a TRT-LLM checkpoint. This should not be confused by distributed checkpoint in Nemo sense.

Anyway, the feature you asked for is not available in TRT-LLM currently. So to build a TRT-LLM engine with world_size=2 one needs to calibrate/quantize model to TRT-LLM checkpoint with world_size=2 and provide this as the input to trtllm-build command. In other words, world_size cannnot be changed at engine build.

@@ -20,6 +20,26 @@ With sharded model weights, you can save and load the state of your training scr

NeMo supports the distributed (sharded) checkpoint format from Megatron-Core. In Megatron-Core, it supports two backends: Zarr-based and PyTorch-based.
Copy link
Collaborator

@jgerh jgerh Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits for 1 - 21.

Checkpoints

This section presents the key functionalities of NVIDIA NeMo that pertain to checkpoint management.

Understand Checkpoint Formats

A .nemo checkpoint is essentially a tar file that combines various components of a trained model. These components include the model configurations (specified in a YAML file), the model weights, and other related artifacts such as tokenizer models or vocabulary files. This design simplifies tasks like sharing, loading, tuning, evaluating, and performing inference with the model.

On the other hand, the .ckpt file, generated during PyTorch Lightning training, contains both the model weights and the optimizer states. It is typically used to resume training from a paused state.

Sharded Model Weights

In both .nemo and .ckpt checkpoints, the model weights can be saved in either a regular format (as a single file named model_weights.ckpt within model parallelism folders) or a sharded format (where they are stored in a folder called model_weights).

Sharded model weights allow you to efficiently save and load the state of your training script across multiple GPUs or nodes. This approach avoids the necessity to modify model partitions when resuming tuning with a different model parallelism setup.

NeMo supports the distributed (sharded) checkpoint format from Megatron Core. In Megatron Core, there are two supported backends: Zarr-based and PyTorch-based.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaoyu-33 that would be sth for you to account for in the destination branch yuya/add_checkpoints_section

├── rank1.safetensors
├── tokenizer.model
└── tokenizer_config.yaml

Community Checkpoint Converter
Copy link
Collaborator

@jgerh jgerh Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits to 45-47

NVIDIA provides easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Supervised Fine-tuning (SFT), Parameter Efficient Fine-Tuning (PEFT), and deployment. Please consult our documentation for detailed instructions and guidelines. We provide comprehensive guides to assist both end users and developers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaoyu-33 this is also sth for you to address here #9329, please have a look

@jgerh jgerh mentioned this pull request Jun 26, 2024
8 tasks
Signed-off-by: Jan Lasek <[email protected]>
@yaoyu-33 yaoyu-33 merged commit ae1c806 into yuya/add_checkpoints_section Jun 27, 2024
10 checks passed
@yaoyu-33 yaoyu-33 deleted the jlasek/add_checkpoints_section_qnemo branch June 27, 2024 16:32
ericharper pushed a commit that referenced this pull request Jul 17, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <[email protected]>

* Fix title

Signed-off-by: yaoyu-33 <[email protected]>

* update

Signed-off-by: yaoyu-33 <[email protected]>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <[email protected]>

* Address review comments

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Transformations

Signed-off-by: Mikołaj Błaż <[email protected]>

* Integration

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add link from intro

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Explain the example

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <[email protected]>

* add subsection

Signed-off-by: yaoyu-33 <[email protected]>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix code block

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* formatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
ertkonuk pushed a commit that referenced this pull request Jul 19, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <[email protected]>

* Fix title

Signed-off-by: yaoyu-33 <[email protected]>

* update

Signed-off-by: yaoyu-33 <[email protected]>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <[email protected]>

* Address review comments

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Transformations

Signed-off-by: Mikołaj Błaż <[email protected]>

* Integration

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add link from intro

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Explain the example

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <[email protected]>

* add subsection

Signed-off-by: yaoyu-33 <[email protected]>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix code block

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* formatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Tugrul Konuk <[email protected]>
tonyjie pushed a commit to tonyjie/NeMo that referenced this pull request Jul 24, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <[email protected]>

* Fix title

Signed-off-by: yaoyu-33 <[email protected]>

* update

Signed-off-by: yaoyu-33 <[email protected]>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <[email protected]>

* Address review comments

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Transformations

Signed-off-by: Mikołaj Błaż <[email protected]>

* Integration

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add link from intro

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Explain the example

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <[email protected]>

* add subsection

Signed-off-by: yaoyu-33 <[email protected]>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix code block

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* formatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
akoumpa pushed a commit that referenced this pull request Jul 25, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <[email protected]>

* Fix title

Signed-off-by: yaoyu-33 <[email protected]>

* update

Signed-off-by: yaoyu-33 <[email protected]>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <[email protected]>

* Address review comments

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Transformations

Signed-off-by: Mikołaj Błaż <[email protected]>

* Integration

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add link from intro

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Explain the example

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <[email protected]>

* add subsection

Signed-off-by: yaoyu-33 <[email protected]>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix code block

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* formatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
malay-nagda pushed a commit to malay-nagda/NeMo that referenced this pull request Jul 26, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <[email protected]>

* Fix title

Signed-off-by: yaoyu-33 <[email protected]>

* update

Signed-off-by: yaoyu-33 <[email protected]>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <[email protected]>

* Address review comments

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Transformations

Signed-off-by: Mikołaj Błaż <[email protected]>

* Integration

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add link from intro

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Explain the example

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <[email protected]>

* add subsection

Signed-off-by: yaoyu-33 <[email protected]>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix code block

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* formatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Malay Nagda <[email protected]>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <[email protected]>

* Fix title

Signed-off-by: yaoyu-33 <[email protected]>

* update

Signed-off-by: yaoyu-33 <[email protected]>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <[email protected]>

* Address review comments

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Transformations

Signed-off-by: Mikołaj Błaż <[email protected]>

* Integration

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add link from intro

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Explain the example

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <[email protected]>

* add subsection

Signed-off-by: yaoyu-33 <[email protected]>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix code block

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* formatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <[email protected]>

* Fix title

Signed-off-by: yaoyu-33 <[email protected]>

* update

Signed-off-by: yaoyu-33 <[email protected]>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <[email protected]>

* Address review comments

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Transformations

Signed-off-by: Mikołaj Błaż <[email protected]>

* Integration

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add link from intro

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Explain the example

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <[email protected]>

* add subsection

Signed-off-by: yaoyu-33 <[email protected]>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix code block

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* formatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
XuesongYang pushed a commit to paarthneekhara/NeMo that referenced this pull request Jan 18, 2025
* Add checkpoints section

Signed-off-by: yaoyu-33 <[email protected]>

* Fix title

Signed-off-by: yaoyu-33 <[email protected]>

* update

Signed-off-by: yaoyu-33 <[email protected]>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <[email protected]>

* Address review comments

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Transformations

Signed-off-by: Mikołaj Błaż <[email protected]>

* Integration

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add link from intro

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Explain the example

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <[email protected]>

* add subsection

Signed-off-by: yaoyu-33 <[email protected]>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix code block

Signed-off-by: yaoyu-33 <[email protected]>

* address comments

Signed-off-by: yaoyu-33 <[email protected]>

* formatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants