NeMo performance feature documentation (NVIDIA#9482)

Signed-off-by: Malay Nagda <[email protected]>
malay-nagda · Jul 26, 2024 · 01e178f · 01e178f
1 parent 59a1235
commit 01e178f
Show file tree

Hide file tree

Showing 6 changed files with 211 additions and 64 deletions.
diff --git a/docs/source/features/activation_recomputation.rst b/docs/source/features/activation_recomputation.rst
@@ -0,0 +1,38 @@
+Activation Recomputation
+========================
+
+The input activations of network layers are stored in the device memory to compute the gradients in back-propagation.
+The input activation stores easily saturate the device memory when training a LLM with a large sequence length or a large micro-batch size.
+Check-pointing a few activations and recomputing the rest of activations is a common technique to reduce the need of device memory.
+
+Transformer Layer Recomputation
+-------------------------------
+
+NeMo supports Transformer layer recomputation that checkpoints the input of each Transformer layer and recomputes the activations on the rest of the layers.
+Transformer layer recomputation significantly reduces the activation memory usage.
+However, this approach increases per-Transformer layer computation cost by 30%, which comes from re-executing the entire layer forwarding computation.
+NeMo also supports partial Transformer layer recomputation, which is beneficial when recomputing a few Transformer layers would fit the training workload on GPU memory.
+This would avoid recomputing the rest of layers.
+
+Transformer layer recomputation is enabled by setting ``activations_checkpoint_granularity=full``.
+The number of Transformer layers to recompute can be set using ``activations_checkpoint_num_layers`` along with ``activations_checkpoint_method=block``.
+If one sets ``activations_checkpoint_num_layers`` as the total number of layers, the inputs of all Transformer layers are check-pointed and recomputed.
+When training with the pipeline parallelism, ``activations_checkpoint_num_layers`` indicates the layers per pipeline stage.
+If the virtual pipelining is used, ``activations_checkpoint_num_layers`` means the layers per virtual pipeline stage.
+
+NeMo also supports checkpointing the input to a block of multiple consecutive Transformer layers meaning that a block of Transformer layers becomes the recomputation granularity.
+This can further save activation memory at the cost of increasing the recomputation buffer memory.
+Thus, it is only beneficial for memory savings when the model has many Transformer layers or the intermediate layers of a Transformer layer hold relatively small activation stores.
+This recomputation mode can be enabled by setting ``activations_checkpoint_method=uniform``, and the number of Transformer layers per recomputation block is set using ``activations_checkpoint_num_layers``.
+
+Self-attention Recomputation
+----------------------------
+
+NeMo supports the self-attention recomputation that checkpoints the inputs of each self-attention block and recomputes the intermediate input activations.
+This is a cost-efficient recomputation method; achieves high memory saving with lost recomputation cost.
+The intermediate layers of the self-attention block accounts for the majority portion the activation memory.
+This is because the input sizes of softmax, dropout, and qkv dot-product attention layers have the memory complexity of the sequence length square.
+However, their recomputation cost is relatively smaller than the other linear projection layers that are linear with the hidden size square.
+
+Self-attention recomputation is hard-enabled when using FlashAttention, which is supported in Transformer Engine.
+Also, a user can use the self-attention recomputation without FlashAttention by setting ``activations_checkpoint_granularity=selective``.
diff --git a/docs/source/features/communication_overlap.rst b/docs/source/features/communication_overlap.rst
@@ -0,0 +1,63 @@
+Communication Overlap
+====================
+
+Data-parallel Communication Overlap
+-----------------------------------
+
+NeMo supports the overlap of the data-parallel (DP) communications with the computations in LLM training.
+NeMo features Distributed Optimizer that distributes optimizer states and the high-precision master parameters across GPUs. This introduces two types of data-parallel communications: reduce-scatter of gradients and all-gather of updated parameters.
+The DP communication is chunked by the granularity of a Transformer layer and overlaps each communication chunk with computation.
+This overlap method exposes only one DP communication chunk ensuring efficient large-scale LLM training.
+When training with pipeline-parallelism, the granularity of DP communication becomes the Transformer layers per virtual pipeline stage.
+
+DP gradient reduce-scatter and parameter all-gather overlaps are enabled when setting ``overlap_grad_sync=true`` and ``overlap_param_sync=true``, respectively.
+The precision of the gradient reduce-scatter is set by ``grad_sync_dtype`` and reduction in bf16 ensures improved performance at large scale training compared to the default precision of fp32.
+When training in fp8 computing precision (with ``fp8=true``), setting ``fp8_params=true`` conducts the parameter all-gather in fp8, reducing the all-gather overhead by half.
+
+Tensor-parallel Communication Overlap
+-------------------------------------
+
+Tensor parallelism, used with the sequence-parallel activation sharding (``sequence_parallel=true``), introduces activation (gradient) all-gather and reduce-scatter as shown in the below figure.
+NeMo provides various options to overlap the tensor-parallel (TP) communications with computation.
+The TP communication without direct computation dependency are overlapped with the computation in bulk (the linear layer and TP communication pairs in the yellow boxes).
+The bulk TP communication is enabled by default.
+The other TP communications with direct computation dependency are overlapped in pipelined fashion (the linear layer and TP communication pairs in the red boxes).
+The TP communication and computation are chunked and the chunks are overlapped in pipeline.
+In the pipelined overlap, the activation (gradient) tensor all-gather is replaced with multiple steps of input P2P ring exchanges, and reduce-scatter is replaced with multiple steps of GEMM output P2P ring exchanges followed by a reduction of the received outputs.
+In case of the reduce-scatter overlap, NeMo also provides the option to pipeline-overlap using chunks of reduce-scatter, which exposes one reduce-scatter chunk.
+
+.. image:: ../nlp/nemo_megatron/images/tp_comm_overlap.png
+    :align: center
+    :width: 600px
+    :alt: Tensor-parallel communication overlap
+
+The pipelined TP communication overlap is implemented in Transformer Engine and is enabled by setting ``ub_tp_comm_overlap=true``.
+The specific overlap methods can be set by a config dictionary, which set and is passed as a yaml file.
+The individual bulk, pipelined all-gather, and reduce-scatter can be en- and disabled by ``tp_comm_bulk_wgrad``, ``tp_comm_bulk_dgrad``, ``tp_comm_overlap_ag``, and ``tp_comm_overlap_rs``, respectively.
+
+Pipeline-parallel Communication Overlap
+---------------------------------------
+
+Pipelining introduces P2P activation (gradient) sends and receives between pipeline-parallel (PP) GPUs.
+The PP communication frequency increases when increasing the virtual-pipeline-parallel size because the number of Transformer layers executed per micro-batch decreases.
+This increasing PP communication overhead and it cancels off the reduced the pipeline bubbles with virtual pipelining.
+NeMo supports the overlap of the PP communications with non-dependant computations in the 1F1B stage (the body of pipelining, where 1X forward and 1X backward micro-batch executions are interleaved).
+The PP communications in pipeline fill and flush are still exposed.
+
+.. image:: ../nlp/nemo_megatron/images/pp_comm_overlap.png
+    :align: center
+    :width: 600px
+    :alt: Pipeline-parallel communication overlap in 1F1B pipelining phase
+
+The PP communication overlap is enabled when setting ``overlap_p2p_comm=true``. Also, setting ``batch_p2p_comm=false`` uses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization.
+NeMo supports PP communication overlap only with virtual pipelining, where PP communication becomes the performance bottleneck.
+Please refer `GPT3 training config file <https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/launcher_scripts/conf/training/gpt3/175b.yaml>`_ that uses the PP communication overlap.
+
+Context-parallel Communication Overlap
+--------------------------------------
+
+Context parallelism partitions activations (gradients) on all layers in the sequence domain. This introduces all-gather and reduce-scatter of activations (gradients) in self-attention forward- and back-propagations.
+NeMo hides the context-parallel (CP) communications under the self-attention computation. 
+Like the TP communication overlaps, the CP communications are chunked then pipeline-overlapped with the self-attention computation, where the all-gather and the reduce-scatter of activations (gradients) are replaced with P2P ring exchanges of data.
+
+The CP communication overlap is default enabled when context parallelism is used (``context_parallel_size > 1``).
diff --git a/docs/source/features/mixed_precision.rst b/docs/source/features/mixed_precision.rst
@@ -3,11 +3,21 @@
 Mixed Precision Training
 ------------------------
 
-Mixed precision training significantly enhances computational efficiency by conducting operations in half-precision and fp8 formats, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly.
+Mixed precision training significantly enhances computational efficiency by conducting operations in low-precision format, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly.
 
 
-FP8 usage
-=========
+Half-precision Training
+=======================
+
+NeMo supports half-precision (FP16 and BF16) computation training via Megatron Core and the distributed optimizer.
+This training recipe uses half-precision in all layer computation keeping the model states (optimizer states and master parameters) in single-precision.
+To avoid repeated data type casting at each layer computation, Megatron Core keeps a separate copy of half-precision parameters that is updated after each optimizer.step.
+
+Half-precision training is enabled when setting ``precision`` to either of ``fp16-mixed`` or ``bf16-mixed`` along with  ``megatron_amp_O2=true``.
+The parameter gradients are computed in the same half-precision, and the precision of gradient reduce-scatter across data-parallel GPUs can be set by ``optim.grad_sync_dtype``.
+
+FP8 Training
+============
 
 Overview
 ^^^^^^^^