diff --git a/docs/source/features/activation_recomputation.rst b/docs/source/features/activation_recomputation.rst new file mode 100644 index 000000000000..5151c2aac48f --- /dev/null +++ b/docs/source/features/activation_recomputation.rst @@ -0,0 +1,38 @@ +Activation Recomputation +======================== + +The input activations of network layers are stored in the device memory to compute the gradients in back-propagation. +The input activation stores easily saturate the device memory when training a LLM with a large sequence length or a large micro-batch size. +Check-pointing a few activations and recomputing the rest of activations is a common technique to reduce the need of device memory. + +Transformer Layer Recomputation +------------------------------- + +NeMo supports Transformer layer recomputation that checkpoints the input of each Transformer layer and recomputes the activations on the rest of the layers. +Transformer layer recomputation significantly reduces the activation memory usage. +However, this approach increases per-Transformer layer computation cost by 30%, which comes from re-executing the entire layer forwarding computation. +NeMo also supports partial Transformer layer recomputation, which is beneficial when recomputing a few Transformer layers would fit the training workload on GPU memory. +This would avoid recomputing the rest of layers. + +Transformer layer recomputation is enabled by setting ``activations_checkpoint_granularity=full``. +The number of Transformer layers to recompute can be set using ``activations_checkpoint_num_layers`` along with ``activations_checkpoint_method=block``. +If one sets ``activations_checkpoint_num_layers`` as the total number of layers, the inputs of all Transformer layers are check-pointed and recomputed. +When training with the pipeline parallelism, ``activations_checkpoint_num_layers`` indicates the layers per pipeline stage. +If the virtual pipelining is used, ``activations_checkpoint_num_layers`` means the layers per virtual pipeline stage. + +NeMo also supports checkpointing the input to a block of multiple consecutive Transformer layers meaning that a block of Transformer layers becomes the recomputation granularity. +This can further save activation memory at the cost of increasing the recomputation buffer memory. +Thus, it is only beneficial for memory savings when the model has many Transformer layers or the intermediate layers of a Transformer layer hold relatively small activation stores. +This recomputation mode can be enabled by setting ``activations_checkpoint_method=uniform``, and the number of Transformer layers per recomputation block is set using ``activations_checkpoint_num_layers``. + +Self-attention Recomputation +---------------------------- + +NeMo supports the self-attention recomputation that checkpoints the inputs of each self-attention block and recomputes the intermediate input activations. +This is a cost-efficient recomputation method; achieves high memory saving with lost recomputation cost. +The intermediate layers of the self-attention block accounts for the majority portion the activation memory. +This is because the input sizes of softmax, dropout, and qkv dot-product attention layers have the memory complexity of the sequence length square. +However, their recomputation cost is relatively smaller than the other linear projection layers that are linear with the hidden size square. + +Self-attention recomputation is hard-enabled when using FlashAttention, which is supported in Transformer Engine. +Also, a user can use the self-attention recomputation without FlashAttention by setting ``activations_checkpoint_granularity=selective``. \ No newline at end of file diff --git a/docs/source/features/communication_overlap.rst b/docs/source/features/communication_overlap.rst new file mode 100644 index 000000000000..605b2ba3d221 --- /dev/null +++ b/docs/source/features/communication_overlap.rst @@ -0,0 +1,63 @@ +Communication Overlap +==================== + +Data-parallel Communication Overlap +----------------------------------- + +NeMo supports the overlap of the data-parallel (DP) communications with the computations in LLM training. +NeMo features Distributed Optimizer that distributes optimizer states and the high-precision master parameters across GPUs. This introduces two types of data-parallel communications: reduce-scatter of gradients and all-gather of updated parameters. +The DP communication is chunked by the granularity of a Transformer layer and overlaps each communication chunk with computation. +This overlap method exposes only one DP communication chunk ensuring efficient large-scale LLM training. +When training with pipeline-parallelism, the granularity of DP communication becomes the Transformer layers per virtual pipeline stage. + +DP gradient reduce-scatter and parameter all-gather overlaps are enabled when setting ``overlap_grad_sync=true`` and ``overlap_param_sync=true``, respectively. +The precision of the gradient reduce-scatter is set by ``grad_sync_dtype`` and reduction in bf16 ensures improved performance at large scale training compared to the default precision of fp32. +When training in fp8 computing precision (with ``fp8=true``), setting ``fp8_params=true`` conducts the parameter all-gather in fp8, reducing the all-gather overhead by half. + +Tensor-parallel Communication Overlap +------------------------------------- + +Tensor parallelism, used with the sequence-parallel activation sharding (``sequence_parallel=true``), introduces activation (gradient) all-gather and reduce-scatter as shown in the below figure. +NeMo provides various options to overlap the tensor-parallel (TP) communications with computation. +The TP communication without direct computation dependency are overlapped with the computation in bulk (the linear layer and TP communication pairs in the yellow boxes). +The bulk TP communication is enabled by default. +The other TP communications with direct computation dependency are overlapped in pipelined fashion (the linear layer and TP communication pairs in the red boxes). +The TP communication and computation are chunked and the chunks are overlapped in pipeline. +In the pipelined overlap, the activation (gradient) tensor all-gather is replaced with multiple steps of input P2P ring exchanges, and reduce-scatter is replaced with multiple steps of GEMM output P2P ring exchanges followed by a reduction of the received outputs. +In case of the reduce-scatter overlap, NeMo also provides the option to pipeline-overlap using chunks of reduce-scatter, which exposes one reduce-scatter chunk. + +.. image:: ../nlp/nemo_megatron/images/tp_comm_overlap.png + :align: center + :width: 600px + :alt: Tensor-parallel communication overlap + +The pipelined TP communication overlap is implemented in Transformer Engine and is enabled by setting ``ub_tp_comm_overlap=true``. +The specific overlap methods can be set by a config dictionary, which set and is passed as a yaml file. +The individual bulk, pipelined all-gather, and reduce-scatter can be en- and disabled by ``tp_comm_bulk_wgrad``, ``tp_comm_bulk_dgrad``, ``tp_comm_overlap_ag``, and ``tp_comm_overlap_rs``, respectively. + +Pipeline-parallel Communication Overlap +--------------------------------------- + +Pipelining introduces P2P activation (gradient) sends and receives between pipeline-parallel (PP) GPUs. +The PP communication frequency increases when increasing the virtual-pipeline-parallel size because the number of Transformer layers executed per micro-batch decreases. +This increasing PP communication overhead and it cancels off the reduced the pipeline bubbles with virtual pipelining. +NeMo supports the overlap of the PP communications with non-dependant computations in the 1F1B stage (the body of pipelining, where 1X forward and 1X backward micro-batch executions are interleaved). +The PP communications in pipeline fill and flush are still exposed. + +.. image:: ../nlp/nemo_megatron/images/pp_comm_overlap.png + :align: center + :width: 600px + :alt: Pipeline-parallel communication overlap in 1F1B pipelining phase + +The PP communication overlap is enabled when setting ``overlap_p2p_comm=true``. Also, setting ``batch_p2p_comm=false`` uses separate kernels for the send and the receive, which further improves the communication efficiency and GPU resource utilization. +NeMo supports PP communication overlap only with virtual pipelining, where PP communication becomes the performance bottleneck. +Please refer `GPT3 training config file `_ that uses the PP communication overlap. + +Context-parallel Communication Overlap +-------------------------------------- + +Context parallelism partitions activations (gradients) on all layers in the sequence domain. This introduces all-gather and reduce-scatter of activations (gradients) in self-attention forward- and back-propagations. +NeMo hides the context-parallel (CP) communications under the self-attention computation. +Like the TP communication overlaps, the CP communications are chunked then pipeline-overlapped with the self-attention computation, where the all-gather and the reduce-scatter of activations (gradients) are replaced with P2P ring exchanges of data. + +The CP communication overlap is default enabled when context parallelism is used (``context_parallel_size > 1``). diff --git a/docs/source/features/mixed_precision.rst b/docs/source/features/mixed_precision.rst index ba0dfb4e945b..7e1e8c2f05fc 100644 --- a/docs/source/features/mixed_precision.rst +++ b/docs/source/features/mixed_precision.rst @@ -3,11 +3,21 @@ Mixed Precision Training ------------------------ -Mixed precision training significantly enhances computational efficiency by conducting operations in half-precision and fp8 formats, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly. +Mixed precision training significantly enhances computational efficiency by conducting operations in low-precision format, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly. -FP8 usage -========= +Half-precision Training +======================= + +NeMo supports half-precision (FP16 and BF16) computation training via Megatron Core and the distributed optimizer. +This training recipe uses half-precision in all layer computation keeping the model states (optimizer states and master parameters) in single-precision. +To avoid repeated data type casting at each layer computation, Megatron Core keeps a separate copy of half-precision parameters that is updated after each optimizer.step. + +Half-precision training is enabled when setting ``precision`` to either of ``fp16-mixed`` or ``bf16-mixed`` along with ``megatron_amp_O2=true``. +The parameter gradients are computed in the same half-precision, and the precision of gradient reduce-scatter across data-parallel GPUs can be set by ``optim.grad_sync_dtype``. + +FP8 Training +============ Overview ^^^^^^^^ diff --git a/docs/source/features/parallelisms.rst b/docs/source/features/parallelisms.rst index 4cc493f40024..bf327fb18331 100644 --- a/docs/source/features/parallelisms.rst +++ b/docs/source/features/parallelisms.rst @@ -1,56 +1,48 @@ .. _parallelisms: Parallelisms ------------- +============ -NeMo Megatron supports five types of parallelism (which can be mixed together arbitrarily). +NeMo Megatron supports various data- and model-parallel deep learning workload deployment methods (which can be mixed together arbitrarily). Data Parallelism -^^^^^^^^^^^^^^^^ +---------------- -Data Parallelism (DP) creates identical copies of the model across -multiple GPUs. Data batches are distributed between GPUs so that the -GPUs can process them independently. While compute is efficiently -distributed between GPUs, communication is required in order to keep -the model copies consistent with each other. +Data Parallelism (DP) replicates the model across multiple GPUs. +Data batches are evenly distributed between GPUs and the data-parallel GPUs process them independently. +While the computation workload is efficiently distributed across GPUs, inter-GPU communication is required in order to keep the model replicas consistent between training steps. Distributed Data Parallelism -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Distributed Data Parallelism (DDP) keeps model copies consistent by -synchronizing parameter gradients before each optimization step. More -specifically, it sums gradients over all model copies using an -all-reduce communication collective. +Distributed Data Parallelism (DDP) keeps the model copies consistent by synchronizing parameter gradients across data-parallel GPUs before each parameter update. +More specifically, it sums the gradients of all model copies using all-reduce communication collectives. .. image:: ../nlp/nemo_megatron/images/ddp.gif :align: center :width: 800px :alt: Distributed Data Parallel -Distributed Optimizer (ZeRO-1) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Distributed Optimizer +^^^^^^^^^^^^^^^^^^^^^ -The ZeRO-1 algorithm keeps model copies consistent by sharding the -optimizer state between GPUs. During each optimization step, the -parameter gradients are first summed and sharded (with a -reduce-scatter collective), each GPU applies an optimization to its -local shard of the parameters, and the updated parameter shards are -broadcast to update all of the model copies (with an all-gather -collective). This approach is attractive for large models since -sharding the optimizer state can significantly reduce its memory -footprint on individual GPUs. It also has, in theory, the same -communication volume as DDP and its communication pattern has more -opportunities for overlapping with compute. +Distributed optimizer is a memory-optimized data-parallel deployment method. +It shards the optimizer states and the high-precision master parameters across data-parallel GPUs instead replicating them. +At the parameter optimizer step, each data-parallel GPU updates its shard of parameters. +Since each GPU needs its own gradient shard, the distributed optimizer conducts reduce-scatter of the parameter gradients instead of all-reduce of them. +Then, the updated parameter shards are all-gathered across data-parallel GPUs. +This approach significantly reduces the memory need of large scale LLM training. +Also, when the precision of the gradient is higher than the parameter precision, the split execution of gradient reduce-scatter and parameter all-gather can reduce the total communication volume. +This split collective execution increases the total computation to overlap with the communication, which improves the overlap opportunity. Enable Data Parallelism ~~~~~~~~~~~~~~~~~~~~~~~ -DDP is the default parallelism scheme when NeMo is run on multiple -GPUs. Enabling other parallelism schemes in the model configuration -will decrease the size of the DP group, that is the number of -identical model copies. +In NeMo, DDP is the default parallel deployment method. +This means that the total number of GPUs corresponds to the size of the DP group and training a LLM with model parallelism decreases the size of the DP group. -To enable the distributed optimizer, set +Currently, NeMo supports optimizer distribution only for Adam optimizer. +To enable the distributed adam optimizer, set ``model.optim.name=distributed_fused_adam`` in the model configuration. It can be configured with the following options: @@ -80,10 +72,36 @@ The distributed optimizer in NeMo is built on top of `DistributedFusedAdam `_ from Apex. +Fully-Shared Data Parallelism (FSDP) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +NeMo supports Fully-Sharded Data Parallelism (FSDP) that shards parameter gradients and low-precision parameters for computation on top of the model states that Distributed optimizer shards (optimizer states and high-precision parameters). +Since FSDP shards the entire model states, it ensures linear model state memory saving with increasing DP size. +FSDP can be preferred for the LLM training with unbalanced workload between pipeline stages (or Transformer layers) or with a large vocabulary size, where pipelining would cause huge computation bubbles due to the workload imbalance. +Also, FSDP unloads the effort to search the performance-optimal mappings with 3D parallelism (TP/PP/DP) because it has a single parallelization domain. + +NeMo uses `pytorch's FSDP interface `_ to shard LLM model states, which flattens the parameters of each Transformer layer and partitions across datap-parallel GPUs. +FSDP introduces collectives across data-parallel GPUs; all-gather of the parameters for computation and reduce-scatter of parameter gradients. +The parameter all-gather occurs in both network forward- and back-propagation phases. The gradient reduce-scatter happens only in the back-propagation. +These FSDP communications are overlapped with Transformer layer computations. + +Setting ``fsdp=true`` enables FSDP. +The mixed precision recipe can be set by ``precision`` knob, which determines both the computation and communication precisions. +Also, one can use ``grad_reduce_dtype`` to override the gradient reduction precision specifically. + + +Model Parallelism +----------------- + +Model parallelism (MP) is a distributed model deployment method that partitions the model parameters across GPUs to reduce the need of per-GPU memory. +NeMo supports various model-parallel methods, which can be mixed to maximize LLM training performance. + Tensor Parallelism ^^^^^^^^^^^^^^^^^^ -Tensor Parallelism (TP) is a method for distributing a model's computation across multiple GPUs by splitting tensors into non-overlapping pieces. This allows different parts of the tensor to be processed simultaneously on separate GPUs, enhancing performance and enabling the training of larger models. +Tensor Parallelism (TP) is a model-parallel partitioning method that distributes the parameter tensor of an individual layer across GPUs. +On top of reducing the model state memory usage, it also saves the activation memory as per-GPU tensor sizes shrinks. +However, the reduced per-GPU tensor lowers per-GPU-kernel workload sizes that increases CPU overhead. .. image:: ../nlp/nemo_megatron/images/tp.gif :align: center @@ -112,6 +130,16 @@ NeMo integrates Tensor Parallelism through the implementation from Megatron Core For detailed API usage and additional configurations, consult the `Megatron Core Developer Guide `_. +FSDP with Tensor Parallelism +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +NeMo supports FSDP along with tensor parallelism. This is done by restricting the model state sharding to the data-parallel domain. +Using FSDP with tensor parallelism can be helpful when the model doesn't have sufficient parallelism to deploy on a large scale training system with the data-parallel mapping. For example, running a model with the global batch size of 1024 on 2048 GPUs. +Also, tensor parallelism enables FSDP feasibility by reducing the model state size and the activation size per GPU, thus lower the FSDP communication overhead and the activation memory overhead. + +Using both FSDP and TP works by enabling FSDP (``fsdp=true``) and setting ``tensor_model_parllel_size > 1``. +The user should unset ``CUDA_DEVICE_MAX_CONNECTIONS`` environment variable to enable that sets the number of GPU kernel queue to overlap of the FSDP communication with computation kernels. + Pipeline Parallelism ^^^^^^^^^^^^^^^^^^^^ @@ -156,6 +184,40 @@ The NeMo implementation of PP leverages functionalities from Megatron Core. For For more detailed API usage and configurations related to PP, visit the `Megatron Core Developer Guide `_. +Expert Parallelism +^^^^^^^^^^^^^^^^^^ +Expert Parallelism (EP) is a type of model parallelism that distributes experts of an MoE across GPUs. +Unlike other model-parallel techniques, EP is applied to only the expert layers thus does not impact the parallel mapping of the rest of layers. + +.. image:: ../nlp/nemo_megatron/images/ep.png + :align: center + :width: 800px + :alt: Expert Parallelism + +Enable Expert Parallelism +~~~~~~~~~~~~~~~~~~~~~~~~~ + +To enable EP, set ``model.expert_model_parallel_size`` to the desired expert parallel size. For example, if the model has six experts (``model.num_moe_experts=6``), then setting ``model.expert_model_parallel_size=3`` results in each GPU processing two experts. The number of experts should be divisible by the expert parallel size. + + .. code-block:: yaml + + expert_model_parallel_size: 3 # Set EP to 3 + +For further information on configuration, refer to the following documentation: `NeMo Megatron GPT Config `_. + + +Implement Expert Parallelism +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The NeMo implementation of Expert Parallelism uses functionality from Megatron Core. Please consult the `Megatron Core MoE layer `_ for more MoE implementation details. + + +Activation Partitioning +----------------------- + +In LLM training, a large memory space is needed to store the input activations of the network layers. +NeMo provides effective activation distribution methods, which is critical in training LLM with a large sequence length or large per-GPU micro-batch size. + Sequence Parallelism ^^^^^^^^^^^^^^^^^^^^ @@ -185,7 +247,8 @@ The NeMo implementation of Sequence Parallelism utilizes functionality from Mega Context Parallelism ^^^^^^^^^^^^^^^^^^^ -Context Parallelism (CP) is a method for parallelizing the processing of neural network activations across multiple GPUs, focusing on the sequence dimension of the input data. Unlike Sequence Parallelism (SP) that only partitions specific types of activations, CP divides all network activations along the sequence dimension. +Context Parallelism (CP) is a method for parallelizing the processing of neural network activations across multiple GPUs, partitioning the input tensors in the sequence dimension. +Unlike Sequence Parallelism (SP) that partitions the activations of specific layers, CP divides the activations of all layers. Enable Context Parallelism ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -212,34 +275,7 @@ Visit our source code for more insights into the implementation: - `Transformer Engine attention modules `_ -Expert Parallelism -^^^^^^^^^^^^^^^^^^ -Expert Parallelism (EP) is a type of model parallelism that distributes experts of an MoE across GPUs. - -.. image:: ../nlp/nemo_megatron/images/ep.png - :align: center - :width: 800px - :alt: Expert Parallelism - -Enable Expert Parallelism -~~~~~~~~~~~~~~~~~~~~~~~~~ - -To enable EP, set ``model.expert_model_parallel_size`` to the desired expert parallel size. For example, if the model has six experts (``model.num_moe_experts=6``), then setting ``model.expert_model_parallel_size=3`` results in each GPU processing two experts. The number of experts should be divisible by the expert parallel size. - - .. code-block:: yaml - - expert_model_parallel_size: 3 # Set EP to 3 - -For further information on configuration, refer to the following documentation: `NeMo Megatron GPT Config `_. - - -Implement Expert Parallelism -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The NeMo implementation of Expert Parallelism uses functionality from Megatron Core. Please consult the `Megatron Core MoE layer `_ for more MoE implementation details. - - -Parallelism nomenclature +Parallelism Nomenclature ^^^^^^^^^^^^^^^^^^^^^^^^ The following figure illustrates some terms that you may encounter in the NeMo Megatron codebase. diff --git a/docs/source/nlp/nemo_megatron/images/pp_comm_overlap.png b/docs/source/nlp/nemo_megatron/images/pp_comm_overlap.png new file mode 100644 index 000000000000..efaaf8f7274f Binary files /dev/null and b/docs/source/nlp/nemo_megatron/images/pp_comm_overlap.png differ diff --git a/docs/source/nlp/nemo_megatron/images/tp_comm_overlap.png b/docs/source/nlp/nemo_megatron/images/tp_comm_overlap.png new file mode 100644 index 000000000000..4b44b20a343d Binary files /dev/null and b/docs/source/nlp/nemo_megatron/images/tp_comm_overlap.png differ