Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeMo performance feature documentation #9482

Merged
merged 10 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions docs/source/features/activation_recomputation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
Activation Recomputation
========================

The input activations of network layers are stored in the device memory to compute the gradients in back-propagation.
The input activation stores easily saturate the device memory when training a LLM with a large sequence length or a large micro-batch size.
Check-pointing a few activations and recomputing the rest of activations is a common technique to reduce the need of device memory.

Transformer Layer Recomputation
-------------------------------

NeMo supports Transformer layer recomputation that checkpoints the input of each Transformer layer and recomputes the activations of the rest of layers.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
Transformer layer recomputation significantly reduces the activation memory usage.
However, this approach increases per-Transformer layer computation cost by 30%, which comes from re-executing the entire layer forwarding computation.
NeMo also supports partial Transformer layer recomputation, which is beneficial when recomputing a few Transformer layers would fit the training workload on GPU memory.
This would avoid recomputing the rest of layers.

Transformer layer recomputation is enabled by setting ``activations_checkpoint_granularity=full``.
The number of Transformer layers to recompute can be set using ``activations_checkpoint_num_layers`` along with ``activations_checkpoint_method=block``.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
If one sets ``activations_checkpoint_num_layers`` as the total number of layers, the inputs of all Transformer layers are check-pointed and recomputed.
When training with the pipeline parallelism, ``activations_checkpoint_num_layers`` indicates the layers per pipeline stage.
If the virtual pipelining is used, ``activations_checkpoint_num_layers`` means the layers per virtual pipeline stage.

NeMo also supports checkpointing the input to a block of multiple consecutive Transformer layers meaning that a block of Transformer layers becomes the recomputation granularity.
This can further save activation memory at the cost of increasing the recomputation buffer memory. Thus, it can only give memory saving when the model has many Transformer layers or the intermediate layers of a Transformer layer hold relatively small activation stores.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
This recomputation mode can be enabled by setting ``activations_checkpoint_method=uniform``, and the number of Transformer layers per recomputation block is set using ``activations_checkpoint_num_layers``.

Self-attention Recomputation
----------------------------

NeMo supports the self-attention recomputation that checkpoints the inputs of each self-attention block and recomputes the intermediate input activations.
This is a cost-efficient recomputation method; achieves high memory saving with lost recomputation cost.
The intermediate layers of the self-attention block accounts for the majority portion the activation memory.
This is because the input sizes of softmax, dropout, and qkv dot-product attention layers have the memory complexity of the sequence length square.
However, their recomputation cost is relatively smaller than the other linear projection layers that are linear with the hidden size square.

Self-attention recomputation is hard-enabled when using FlashAttention, which is supported in Transformer Engine.
Also, a user can use the self-attention recomputation without FlashAttention by setting ``activations_checkpoint_granularity=selective``.
63 changes: 63 additions & 0 deletions docs/source/features/communication_overlap.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
Communication Overlap
====================

Data-parallel communication overlap
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
-----------------------------------

NeMo supports the overlap of the data-parallel (DP) communications with the computations in LLM training.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
NeMo features Distributed Optimizer that distributes optimizer states and the high precision master parameters across GPUs and this introduces two types of data-parallel communications; reduce-scatter of gradients and all-gather of updated parameters.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
The DP communication is chunked by the granularity of a Transformer layer and overlap each communication chunk with computation.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
This overlap method exposes only one DP communication chunk ensuring efficient large-scale LLM training.
When training with pipeline-parallelism, the granularity of DP communication becomes the Transformer layers per virtual pipeline stage.

DP gradient reduce-scatter and parameter all-gather overlaps are enabled when setting ``overlap_grad_sync=true`` and ``overlap_param_sync=true``, respectively.
The precision of the gradient reduce-scatter is set by ``grad_sync_dtype`` and reduction in bf16 ensures improved performance at large scale training compared to the default precision of fp32.
When training in fp8 computing precision (with ``fp8=true``), setting ``fp8_params=true`` conducts the parameter all-gather in fp8, reducing the all-gather overhead by half.

Tensor-parallel communication overlap
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
-------------------------------------

Tensor parallelism, used with the sequence-parallel activation sharding (``sequence_parallel=true``), introduces activation (gradient) all-gather and reduce-scatter as shown in the below figure.
NeMo provides various options to overlap the tensor-parallel (TP) communications with computation.
The TP communication without direct computation dependency are overlapped with the computation in bulk (the linear layer and TP communication pairs in the yellow boxes).
The bulk TP communication is enabled by default.
The other TP communications with direct computation dependency are overlapped in pipelined fashion (the linear layer and TP communication pairs in the red boxes).
The TP communication and computation are chunked and the chunks are overlapped in pipeline.
In the pipelined overlap, the activation (gradient) tensor all-gather is replaced with multiple steps of input P2P ring exchanges, and reduce-scatter is replaced with multiple steps of GEMM output P2P ring exchanges followed by a reduction of the received outputs.
In case of the reduce-scatter overlap, NeMo also provides the option to pipeline-overlap using chunks of reduce-scatter, which exposes one reduce-scatter chunk.

.. image:: ../nlp/nemo_megatron/images/tp_comm_overlap.png
:align: center
:width: 600px
:alt: Tensor-parallel communication overlap

The pipelined TP communication overlap is implemented in Transformer Engine and is enabled by setting ``ub_tp_comm_overlap=true``.
The specific overlap methods can be set by a config dictionary, which set and is passed as a yaml file.
The individual bulk, pipelined all-gather, and reduce-scatter can be en- and disabled by ``tp_comm_bulk_wgrad``, ``tp_comm_bulk_dgrad``, ``tp_comm_overlap_ag``, and ``tp_comm_overlap_rs``, respectively.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved

Pipeline-parallel communication overlap
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
---------------------------------------

Pipelining introduces P2P activation (gradient) sends and receives between pipeline-parallel (PP) GPUs.
The PP communication frequency increases when increasing the virtual-pipeline-parallel size because the number of Transformer layers executed per micro-batch decreases.
This increasing PP communication overhead cancels off the reduced the pipeline bubbles with virtual pipelining.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
NeMo supports the overlap of the PP communications with non-dependant computations in the 1F1B stage (the body of pipelining, where 1X forward and 1X backward micro-batch executions are interleaved).
The PP communications in pipeline fill and flush still exposed.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved

.. image:: ../nlp/nemo_megatron/images/pp_comm_overlap.png
:align: center
:width: 600px
:alt: Pipeline-parallel communication overlap in 1F1B pipelining phase

The PP communication overlap is enabled when setting ``overlap_p2p_comm=true``. Also, setting ``batch_p2p_comm=false`` uses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
NeMo supports PP communication overlap only with virtual pipelining, where PP communication becomes the performance bottleneck.
Please refer `GPT3 training config file <https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/conf/training/gpt3/175b.yaml>`_ that uses the PP communication overlap.

Context-parallel communication overlap
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
--------------------------------------

Context parallelism partitions activations (gradients) of all layers in sequence domain, which introduces all-gather and reduce-scatter of activations (gradients) in self-attention forward- and back-propagations.
erhoo82 marked this conversation as resolved.
Show resolved Hide resolved
NeMo hides the context-parallel (CP) communications under the self-attention computation.
Like the TP communication overlaps, the CP communications are chunked then pipeline-overlapped with the self-attention computation, where the all-gather and the reduce-scatter of activations (gradients) are replaced with P2P ring exchanges of data.

The CP communication overlap is default enabled when context parallelism is used (``context_parallel_size > 1``).
16 changes: 13 additions & 3 deletions docs/source/features/mixed_precision.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,21 @@
Mixed Precision Training
------------------------

Mixed precision training significantly enhances computational efficiency by conducting operations in half-precision and fp8 formats, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly.
Mixed precision training significantly enhances computational efficiency by conducting operations in low-precision format, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly.


FP8 usage
=========
Half-precision Training
=======================

NeMo supports half-precision (FP16 and BF16) computation training via Megatron Core and the distributed optimizer.
This training recipe uses half-precision in all layer computation keeping the model states (optimizer states and master parameters) in single-precision.
To avoid repeated data type casting at each layer computation, Megatron Core keeps a separate copy of half-precision parameters that is updated after each optimizer.step.

Half-precision training is enabled when setting ``precision`` to either of ``fp16-mixed`` or ``bf16-mixed`` along with ``megatron_amp_O2=true``.
The parameter gradients are computed in the same half-precision, and the precision of gradient reduce-scatter across data-parallel GPUs can be set by ``optim.grad_sync_dtype``.

FP8 Training
============

Overview
^^^^^^^^
Expand Down
Loading
Loading