Multi GPU Support

Compiling for Multi-GPU

To enable multi-GPU support, you need to set either QUDA_MPI=on or QUDA_QMP=on to enable either the MPI or QMP communications back end. QMP is the USQCD QCD communications layer, that provides compatibility with other USQCD software packages. To enable QUDA to use QIO directly, you need to enable QMP. Cmake should detect the MPI compiler and libraries by default, but can be set manually if needed (use ccmake to configure then switch to advanced options using 't' to set all the paths by hand).

Compiling QMP and QIO

QMP compilation instructions can be found here. Once built ensure that the file qmp/bin/qmp-config is readable and executable in your environment: chmod +xr $PATH_TO_QMP_INSTALLATION/bin/qmp-config.

QIO compilation requires c-lime as a dependency. This is most easily obtained using recursive cloning git clone --recursive [email protected]:usqcd-software/qio.git. Navigate to qio/ and execute the command autoreconf -f -i. You can now run configure for your preferences, then make and finally make install.

Running

Running on multiple GPUs is similar to running any other MPI application. In general, one process will be assigned to each GPU, unless one is running with CUDA MPS enabled. One important thing to note is to make sure that all of QUDA's environmental variables are propagated to all processes, since these control some of QUDA's internal control flow, each process should see the same variable settings. Perhaps the easiest way to do this is to use a run script as given below. Alternatively one can set the broadcast of environment variables using the job launcher, e.g., with OpenMPI's mpirun using -x QUDA_RESOURCE_PATH=/path/to/somewhere.

Running QUDA's tests

When running QUDA through a host application, typically the host application is responsible for setting the process topology and local problem size. For QUDA's internal tests, these parameters are set using the following command-line parameters

--dim x y z t           # x y z t is the local (per process) problem size
--gridsize X Y Z T      # X Y Z T is the process topology

Multi-GPU emulation

To aid performance modelling and debugging, it is possible to switch on communication in a given dimension, even if in actuality that dimension is local to a given GPU. The command line flag --partition N facilitates this feature, where N is a 4-bit number, with bits 0,1,2,3 used to switch on/off communication in dimensions x,y,z,t (respectively). For example:

dslash_test --partition 1     ## enable x dimension communication
dslash_test --partition 6     ## enable y and z dimension communication
dslash_test --partition 15    ## enable full communication

Peer-to-peer communication

QUDA will automatically detect multiple GPUs in the same node and use direct peer-to-peer communication where available. For GPUs to be peer-to-peer capable, they need to be either on the same PCIe root complex (e.g., connected to the same CPU socket or PCIe switch) or be directly connected with an NVLink connection. While peer-to-peer communication will lead to much improved performance versus leaving MPI to handle the inter-GPU communication, it can useful for benchmarking and/or debugging to disable it. This can be done by setting the environment variable QUDA_ENABLE_P2P=0.

GPU Direct RDMA and CUDA-aware MPI

QUDA can be optionally support GPU-aware MPI and GPU Direct RDMA (GDR), i.e., where data is passed directly to MPI without first copying it to the host, or conversely data is received directly into GPU memory. By default this option is disabled since passing a GPU pointer to an MPI library that is unaware of GPUs will lead to undefined behaviour (most likely a segmentation fault). This can be enabled by setting the environment variable QUDA_ENABLE_GDR=1. When doing so, you should ensure that the MPI library you are using is also GPU enabled and the network drivers support it. For Mellanox Infiniband, this means OFED v2.1 or v3.1 and later (which depends on which IB card is in your system), as well as installation of the NVIDIA peer memory driver. Details for Mellanox can be found here.

On systems that do not support GDR, but are running a CUDA-aware MPI library, e.g., OpenMPI, MVAPICH2, then the MPI library can automatically stage the MPI buffers in CPU memory if provided with a GPU pointer. Typically letting the MPI library take care of this staging is slower than having QUDA do it since it introduces unnecessary synchronization. However, we note that on systems that do not have a NIC available, enabling GDR support and using this in combination with GPU-aware MPI can be beneficial for debugging, if not performance.

It should be noted that enabling GDR will never make the performance worse, since the dslash policy autotuner will automatically test all enabled policies, e.g., basic, GDR-enabled, etc., and pick the best one for each given precision, volume, etc. Details on the dslash policy tuning are given below.

OpenMPI

Instructions for building CUDA-aware OpenMPI can be found here. Instructions for running CUDA-aware OpenMPI can be found here.

Below we give an example run script of how to use OpenMPI with GDR support and instructions for optimal process placement.

MVAPICH2

Instructions for running the current GDR-enabled MVAPICH2 can be found here. MVAPICH2-GDR is only available as a binary, but the source code for the regular CUDA-aware MVAPICH2 (with host message staging) is available.

To enable CUDA-awareness in MVAPICH2, if building from source you must set --enable-cuda when running configure. When running, you must set the environment variable MV2_USE_CUDA=1. Specific GDR-related instructions can be found here. One critical environment variable that should be set is MV2_GPUDIRECT_LIMIT=1000000000 which will force enable RDMA for all message exchange.

Cray MPI

To GPU-awareness on Cray's MPI you need to set the environment variable MPICH_RDMA_ENABLED_CUDA=1. At present Cray's implementation provides no user control over which messages will be exchanged using RDMA versus using host staging. This means that MPI exchange can go through the CPU memory with no means to force enable RDMA. The end result is that only very small volumes will utilize RDMA on Cray's XC platform, which most likely means only coarse grids with multigrid.

We note that performance on Cray systems may be improved by enabling MPICH_NEMESIS_ASYNC_PROGRESS=1, which enables results in the MPI library spawning threads to ensure the forward progress of asynchronous MPI calls (which QUDA utilizes).

Maximizing GDR performance

On systems with multiple GPUs and multiple NIC, to ensure maximum GPU-NIC throughput, care must be taken to ensure that GPUs communicate with the closest NIC. This can be done by querying the topology of the machine you are running on, and then instrumenting your MPI and / or run script to ensure correct placement.

For example, when running on DGX-1, which is a system with 4x EDR NICs and 8x P100 GPUs. Each pair of GPUs shares a NIC, so we need to ensure that the local NIC to each pair is used for all non-peer-to-peer communication.

First of all, we query the node topology with nvidia-smi topo -m

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_2	mlx5_1	mlx5_3	CPU Affinity
GPU0	 X 	NV1	NV1	NV1	NV1	SOC	SOC	SOC	PIX	SOC	PHB	SOC	0-19
GPU1	NV1	 X 	NV1	NV1	SOC	NV1	SOC	SOC	PIX	SOC	PHB	SOC	0-19
GPU2	NV1	NV1	 X 	NV1	SOC	SOC	NV1	SOC	PHB	SOC	PIX	SOC	0-19
GPU3	NV1	NV1	NV1	 X 	SOC	SOC	SOC	NV1	PHB	SOC	PIX	SOC	0-19
GPU4	NV1	SOC	SOC	SOC	 X 	NV1	NV1	NV1	SOC	PIX	SOC	PHB	20-39
GPU5	SOC	NV1	SOC	SOC	NV1	 X 	NV1	NV1	SOC	PIX	SOC	PHB	20-39
GPU6	SOC	SOC	NV1	SOC	NV1	NV1	 X 	NV1	SOC	PHB	SOC	PIX	20-39
GPU7	SOC	SOC	SOC	NV1	NV1	NV1	NV1	 X 	SOC	PHB	SOC	PIX	20-39
mlx5_0	PIX	PIX	PHB	PHB	SOC	SOC	SOC	SOC	 X 	SOC	PHB	SOC	
mlx5_2	SOC	SOC	SOC	SOC	PIX	PIX	PHB	PHB	SOC	 X 	SOC	PHB	
mlx5_1	PHB	PHB	PIX	PIX	SOC	SOC	SOC	SOC	PHB	SOC	 X 	SOC	
mlx5_3	SOC	SOC	SOC	SOC	PHB	PHB	PIX	PIX	SOC	PHB	SOC	 X 	

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

We see that there are eight GPUs and four NICs as expected. The critical point is that GPU0 and GPU1 are both connected directly to mlx5_0 on the same PCIe switch, with GPU2 and GPU3 on mlx5_1, etc. So when launching our job on multiple nodes we need to ensure that processes mapped to the these GPUs are instructed to use these NICs.

The script below (for OpenMPI) achieves that. To use this script with QUDA's dslash_test, running on 16 nodes of DGX-1, it would be launched with something like

mpirun 
 -np 128                                                                    # total number of processes
 -npernode 8                                                                # number of processes per node
 --bind-to none                                                             # lets the user overrule binding using numactl
 -hostfile ./hostfile                                                       # list of hosts we want to run on
 --mca btl sm,self,openib                                                   # enable intra-node, loop back to self, and IB
 --mca btl_openib_want_cuda_gdr 1                                           # enable GDR for MPI
 --mca btl_openib_cuda_rdma_limit 1000000000                                # set the largest message size for GDR
 -x EXE="./dslash_test"                                                     # executable
 -x ARGS="--gridsize 2 2 4 8 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
 ./run.sh                                                                   # name of the below script

In the run.sh script we set the order of CUDA devices as how they will be mapped to the local MPI ranks (the REORDER variable). Given this order, we ensure that the closest NIC for a given process is assigned, and furthermore we set the the CPU cores available for each process to obtain the correct non-overlapping NUMA mapping.

#!/bin/bash                                                                                                                                                                                                                                                

# QUDA specific-environment variables                                                                                                                                                                                                                      

# set the QUDA tunecache path                                                                                                                                                                                                                              
export QUDA_RESOURCE_PATH=.

# enable GDR support                                                                                                                                                                                                                                       
export QUDA_ENABLE_GDR=1

# this is the list of GPUs we have                                                                                                                                                                                                                         
GPUS=(0 1 2 3 4 5 6 7)

# This is the list of NICs we should use for each GPU                                                                                                                                                                                                      
# e.g., associate GPU0,1 with MLX0, GPU2,3 with MLX1, GPU4,5 with MLX2 and GPU6,7 with MLX3                                                                                                                                                                
NICS=(mlx5_0 mlx5_0 mlx5_1 mlx5_1 mlx5_2 mlx5_2 mlx5_3 mlx5_3)

# This is the list of CPU cores we should use for each GPU                                                                                                                                                                                                 
# e.g., 2x20 core CPUs split into 4 threads per process with correct NUMA assignment                                                                                                                                                                       
CPUS=(1-4 5-8 10-13 15-18 21-24 25-28 30-33 35-38)

# Number of physical CPU cores per GPU                                                                                                                                                                                                                     
export OMP_NUM_THREADS=4

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)                                                                                                                                                                      
REORDER=(0 1 2 3 4 5 6 7)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping                                                                                                                                                
export CUDA_VISIBLE_DEVICE="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]},${GPUS[${REORDER[6]}]},${GPUS[${REORDER[7]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]} ${NICS[${REORDER[6]}]} ${NICS[${REORDER[7]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]} ${CPUS[${REORDER[6]}]} ${CPUS[${REORDER[7]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export OMPI_MCA_btl_openib_if_include=${NIC_REORDER[lrank]}
numactl --physcpubind=${CPU_REORDER[$lrank]} \
        $APP

In the example above, the REORDER variable tells us the order we want the GPUs to map to the local MPI process. Here we have only used the default ordering, e.g., REORDER=(0 1 2 3 4 5 6 7), which would produce an optimal mapping for a local 1x2x2x2 process topology (e.g., given the NVLink topology of DGX-1, GPU 0 can communicate with GPUs 1, 2 and 4 which are the only GPUs needed for this 3-d topology). However, if we were running with 1x1x2x4 local process topology (given that the default MPI process topology is ((px*Ny + py)Nz + pz)Nt + pt then GPU 0 would need to be able to communicate with GPUs 1 and 5, but only has connections to GPUs 0, 2, 3, and 5.** So in this case, we would want to use REORDER=(0 1 2 3 6 7 4 5) which would provide the optimal peer-to-peer connectivity matrix.

** This is the default for QUDA and MILC. BQCD on the other hand uses the inverse of this mapping ((px*Ny + py)Nz + pz)Nt + pt. In this case, BQCD mapping would actually provide the optimal peer-to-peer mapping with the default GPU order.

Asymmetric Topologies

On asymmetric systems where some GPUs are on one side of the QPI bus and the NIC is on the other, care must be taken since the QPI bus cannot efficiently forward the memory traffic between attached PCIe devices. For example the following system has four GPUs and a NIC one socket, with two GPUs and not NIC on the other socket. This system will only give efficient GDR support for the first four GPUs, with the other two needing to stage their inter-node memory traffic explicitly through CPU memory.

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	mlx5_0	CPU Affinity
GPU0	 X 	PIX	PHB	PHB	SOC	SOC	PHB	0-9
GPU1	PIX	 X 	PHB	PHB	SOC	SOC	PHB	0-9
GPU2	PHB	PHB	 X 	PIX	SOC	SOC	PHB	0-9
GPU3	PHB	PHB	PIX	 X 	SOC	SOC	PHB	0-9
GPU4	SOC	SOC	SOC	SOC	 X 	PHB	SOC	10-19
GPU5	SOC	SOC	SOC	SOC	PHB	 X 	SOC	10-19
mlx5_0	PHB	PHB	PHB	PHB	SOC	SOC	 X

To enable such a setup, the environment variable QUDA_ENABLE_GDR_BLACKLIST can be used to exclude a given number of GPUs from using GDR, and instead will fallback to using explicit staging through CPU memory. The below is an example of how to do this for the above topology using OpenMPI.

mpirun
 -np 48                                                                     # total number of processes
 -npernode 6                                                                # number of processes per node
 --bind-to none                                                             # lets the user overrule binding using numactl
 -hostfile ./hostfile                                                       # list of hosts we want to run on
 --mca btl sm,self,openib                                                   # enable intra-node, loop back to self, and IB
 --mca btl_openib_want_cuda_gdr 1                                           # enable GDR for MPI
 --mca btl_openib_cuda_rdma_limit 1000000000                                # set the largest message size for GDR
 -x EXE="./dslash_test"                                                     # executable
 -x ARGS="--gridsize 2 2 2 6 --dim 24 24 24 24 --prec double --niter 10000" # executable run-time options
 ./run.sh

where run.sh would be as given below

#!/bin/bash

# QUDA specific-environment variables

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.

# enable GDR support
#export QUDA_ENABLE_GDR=1

# exclude GPUs 4 and 5 from GDR since it's across QPI
export QUDA_ENABLE_GDR_BLACKLIST="4,5"

# this is the list of GPUs we have
GPUS=(0 1 2 3 4 5)

# This is the list of NICs we should use for each GPU
NICS=(mlx5_0 mlx5_0 mlx5_0 mlx5_0 mlx5_0 mlx5_0)

# This is the list of CPU cores we should use for each GPU
# e.g., 2x10 core CPUs split into 2 threads per process with correct NUMA assignment
CPUS=(1-2 3-4 5-6 7-8 10-11 15-16)

# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=2

# this is the order we want the GPUs to be assigned in (e.g. for NVLink connectivity)
REORDER=(0 1 2 3 4 5)

# now given the REORDER array, we set CUDA_VISIBLE_DEVICES, NIC_REORDER and CPU_REORDER to for this mapping                                                                                                                                         
       
export CUDA_VISIBLE_DEVICE="${GPUS[${REORDER[0]}]},${GPUS[${REORDER[1]}]},${GPUS[${REORDER[2]}]},${GPUS[${REORDER[3]}]},${GPUS[${REORDER[4]}]},${GPUS[${REORDER[5]}]}"
NIC_REORDER=(${NICS[${REORDER[0]}]} ${NICS[${REORDER[1]}]} ${NICS[${REORDER[2]}]} ${NICS[${REORDER[3]}]} ${NICS[${REORDER[4]}]} ${NICS[${REORDER[5]}]})
CPU_REORDER=(${CPUS[${REORDER[0]}]} ${CPUS[${REORDER[1]}]} ${CPUS[${REORDER[2]}]} ${CPUS[${REORDER[3]}]} ${CPUS[${REORDER[4]}]} ${CPUS[${REORDER[5]}]})

APP="$EXE $ARGS"

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export OMPI_MCA_btl_openib_if_include=${NIC_REORDER[lrank]}
numactl --physcpubind=${CPU_REORDER[$lrank]} \
        $APP

Dependence on CUDA_DEVICE_MAX_CONNECTIONS

There is an environment variable called CUDA_DEVICE_MAX_CONNECTIONS, this controls how many hardware channels the GPU should use, e.g., how much work can be launched independently from different streams without any false dependencies. However, since it has the lowest latency, QUDA gets optimum performance at CUDA_DEVICE_MAX_CONNECTIONS=1 since this gives the lower latency and you still get overlap between kernel launches and memory copies in general due to the order in which these are issued. So in general, the advice is to set this parameter equal to 1 and this will optimal scaling.

Low-level Details

Dslash Policy Tuning

Since the optimum Dslash overlapping computation and communication strategy varies depending the machine you running on, the size of the problem you running, the precision, etc., QUDA implements multiple dslash execution policies and utilizes the autotuner to identify the optimal strategy for a given parameter set and use that policy for all subsequent invocations (dslash_policy.cuh). At present the following policies are enabled in QUDA:

QUDA_DSLASH = 0: bandwidth optimized - aim for maximum compute and comms overlap (one halo kernel per dimension)
QUDA_FUSED_DSLASH = 1: kernel latency optimized - use a single halo update kernel for all dimensions
QUDA_GPU_COMMS_DSLASH = 2: GDR-enabled variant of QUDA_DSLASH
QUDA_FUSED_GPU_COMMS_DSLASH = 3: GDR-enabled variant of QUDA_FUSED_DSLASH
QUDA_ZERO_COPY_DSLASH_PACK = 4: write the non-p2p packed halo buffers directly to CPU memory for minimum MPI_Send latency
QUDA_FUSED_ZERO_COPY_DSLASH_PACK = 5: write non-p2p packed halo buffers directly to CPU memory and use fused halo kernel
QUDA_ZERO_COPY_DSLASH = 6: write non-p2p halo buffer directly to CPU memory and read halos directly from CPU memory in halo update kernels (experimental, not yet enabled)
QUDA_FUSED_ZERO_COPY_DSLASH = 7: write non-p2p halo buffer directly to CPU memory and read halos directly from CPU memory in a single halo update kernel (experimental, not yet enabled)
QUDA_DSLASH_ASYNC = 8: Experimental, not enabled
QUDA_FUSED_DSLASH_ASYNC = 9: Experimental, not enabled
QUDA_PTHREADS_DSLASH = 10: Experimental, not enabled
QUDA_DSLASH_NC = 11: For dslash kernels with no communication, e.g., Dslash5 dwf operator. Not used by policy tuner.

Note the QUDA_GDR_DSLASH and QUDA_FUSED_GDR_DSLASH are only enabled if QUDA_ENABLE_GDR is set. In most instances, you will just want to let the autotuner pick the best policy for your parameter set. However, you can restrict the policy set to tune other by setting the environment argument QUDA_ENABLE_DSLASH_POLICY, e.g., setting QUDA_ENABLE_DSLASH_POLICY=1,3,5 would restrict the policy tuning to the "fused" variants only.

By default all policies will use peer-to-peer communication if available. To disable peer-to-peer, you set QUDA_ENABLE_P2P=0. Finally we note that if between a tuned run and a subsequent run, either the QUDA_ENABLE_P2P or QUDA_ENABLE_GDR environment variables change, then the autotuner will exit, forcing a retune.

Dslash Component Benchmarking

In order to benchmark the components of the Dslash in isolation, QUDA can selectively disable portions of the Dslash computation. This is useful for example to benchmarking NIC performance, or to test kernel performance in the absence of communication. The dslash computation is broken down into multiple steps:

packing: prepare contiguous halo buffers to be handed off to MPI / P2P communication
comms: p2p cudaMemcpy within the node and MPI between nodes
interior: apply the dslash stencil on the interior while the halo regions are being communicated
exterior: once the comms have finished we finish the calculation with the application of the halo on the boundary elements (copy: when GDR / P2P is not available between a set of GPUs, then we have the additional D2H/H2D memcpys for staging the MPI buffers in CPU memory)

The following set of environment variables can be used to disable the various parts of computation and/or the communication. All of the below variables default to 1 (e.g., do the full calculation), but can be disabled by setting to 0 (obviously result will be wrong).

QUDA_ENABLE_DSLASH_PACK - enable / disable initial packing kernel
QUDA_ENABLE_DSLASH_COMMS - enable/disable P2P memcpys and / or MPI exchange
QUDA_ENABLE_DSLASH_INTERIOR - enable/disable interior kernel computation
QUDA_ENABLE_DSLASH_EXTERIOR - enable/disable exterior kernel computation
QUDA_ENABLE_DSLASH_COPY - enable/disable host staging copies for MPI if GDR/P2P not enabled

By combining the explicit policy choice with the above variables, we can benchmark in isolation any computation or communication pattern.

For communication benchmarking, the dslash_test and staggered_dslash_test programs will report the effective bi-directional bandwidth sustained by the algorithm (just grep the output for “bi”). See the below results for an example taken between two GPUs connected using PCIe peer-to-peer. With the full computation enabled we are unable to see what the actual achieved bi-directional bandwidth is, e.g., it plateaus once all the communications are hidden by the local computation, but when you only do the communications we see the expected behavior and the bi-directional bandwidth is saturating at 19 GB/s. What we can also see is the bandwidth doesn't saturate until a relatively large local volume: this is the motivation for future work using a SHMEM-style programming model where all peer-to-peer reads and write will be done by reading and writing directly to neighboring GPUs, which has significantly lower latency.

QUDA calls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly