-
Notifications
You must be signed in to change notification settings - Fork 101
MILC with QUDA
These instructions are intended to be a quick start guide to getting MILC running with GPUs using the QUDA library.
You can obtain QUDA using the following:
git clone https://github.com/lattice/quda.git
At the time of writing, the current stable release of QUDA is the 1.0 pre-release, located in the release/1.0.x
branch. Further improvements are merged into the develop
branch (which is the git default) so if you want to live on the bleeding edge, you can use this branch, else use the 1.0 branch.
cd quda
git checkout release/1.0.x
cd ..
QUDA uses cmake
to set compilation options. For running with HISQ fermions, e.g., the su3_rhmc_hisq
test that is commonly used in MILC, you should do something like the following
mkdir build
cd build
cmake ../quda -DQUDA_GPU_ARCH=sm_70 -DQUDA_DIRAC_WILSON=OFF -DQUDA_DIRAC_CLOVER=OFF -DQUDA_DIRAC_TWISTED_MASS=OFF -DQUDA_DIRAC_TWISTED_CLOVER=OFF -DQUDA_DIRAC_NDEG_TWISTED_MASS=OFF -DQUDA_DIRAC_DOMAIN_WALL=OFF -DQUDA_LINK_HISQ=ON -DQUDA_FORCE_HISQ=ON -DQUDA_FORCE_GAUGE=ON -DQUDA_BUILD_SHAREDLIB=ON -DQUDA_QMP=ON -DQUDA_QIO=ON -DQUDA_DOWNLOAD_USQCD=ON
cmake .
The final cmake
command above is often required to ensure that cmake completes its configuration fully, and is a known issue. Above, we implicitly assume that the CUDA and MPI compilers are present in the $PATH
. Here we are setting the the GPU architecture to sm_70
which corresponds to Volta. Choices include:
-
sm_35
for Kepler (Tesla K20 / K40 / K80) -
sm_52
for Maxwell (Tesla M40 / Quadro M6000) -
sm_60
for Pascal (Tesla P100, Quadro GP100) -
sm_70
for Volta (Tesla V100, Quadro V100)
Here we are disabling unnecessary parts of QUDA when used with MILC, in order to reduce compilation time. The final three arguments concern the installation of the USQCD companion libraries QMP and QIO. QUDA can automate their download and installation, and that is what we have enabled here.
Then finally to build QUDA, you should use a parallel build as QUDA can take a long time to build,
make -j N
where N
is the number of cores / threads that the compilation node has. We typically recommend setting this to the number of hardware threads (e.g., hyperthreads) in the system. If you have set an install path when running cmake, then to complete the installation
make install
For use with QUDA we recommend the present develop
branch of MILC. This enables the maximum benefit of QUDA acceleration.
git clone https://github.com/milc-qcd/milc_qcd.git
cd milc_qcd
git checkout develop
To aid compilation of MILC with QUDA support, there is a provided helper script for the su3_rhmd_hisq
application ks_imp_rhmc/compile_su3_rhmd_hisq_quda.sh. Editing this script as appropriate and executing from its directory should result in a full build of MILC with QUDA acceleration for the desired application. For a standard build the important settings are CUDA_HOME
, QUDA_HOME
, QIOPAR
and QMPPAR
. It is trivial to modify this script to accommodate building different executables, e.g., replacing the su3_rhmd_hisq
executable name in the script with the desired one. Note that we need to point MILC to the installed QMP and QIO as part of the QUDA installation, these will be located in the usqcd
directory in the QUDA build directory.
cd ks_imp_rhmc
cp ../Makefile .
./compile_su3_rhmd_hisq_quda.sh
The build of MILC should now be complete.
Typically, running MILC with QUDA is exactly like running MILC without QUDA. There is a one-to-one mapping between the number of GPUs and the number of MPI processes in the system. If you have followed the above instructions to build QUDA, then QUDA will have been built as a shared library, and so you will need to include the path to the QUDA library in the LD_LIBRARY_PATH
.
Typically, the CUDA Multi-Process Service (MPS) should not be enabled as this will only decrease performance. An exception to this could be if running on a system with many CPU cores, and MPI performance is superior to OpenMP performance. Otherwise just set OMP_NUM_THREADS
(or equivalent) to the number of cores available per process (per GPU).
Set a location for QUDA to write out its autotuning cache: e.g.,
export QUDA_RESOURCE_PATH=/tmp
On the first run QUDA will dump the kernel launch parameters here, for use in later runs. Thus to get optimum performance you should do first a tuning run, and then do a benchmarking run afterwards. This path should be set to a location that is accessible by whichever nodes are running the executable.
For guidelines in how to improve strong scaling performance (fixed problem size as the number of GPUs is increased) you can refer to these pages quick-start and multi-gpu.
By default MILC will attempt to split the problem between processes in order to minimize the surface-to-volume ratio of the local problem size. This is in general a good thing to do, however, MILC favours partitioning the fastest running X dimension rather than that slowest running T dimension first. This is bad for running on modern architectures since it leads to strided memory accesses when doing the X-face halo update. The process grid topology can be set manually making it easy to override this, using the command-line option
-qmp-geom MX MY MZ MT
to specify a partitioning of the X axis in MX
equal segments, the Y axis into MY
segments, etc. So, for example, with a lattice size 32x32x32x64 and 8 MPI ranks the command
mpirun -np 8 ./su3_rhmc_hisq -qmp-geom 1 1 2 4 ...(other options)
would result in local volumes of 32x32x16x16
on a 1x1x2x4
grid of virtual processors. Without this additional flag, the process topology would default to local problem size 16x16x32x32
(partitioning in T first since it has length 64, then split from the X dimension upwards) which leads to strided memory accesses.
See also the NERSC-MILC page for further details about launching large-scale jobs.