-
Notifications
You must be signed in to change notification settings - Fork 101
QUDA Quick Start Guide
As of QUDA 0.8 the preferred build is using cmake. See Building QUDA with cmake for details. The description of the options below might still be useful as the names used for the cmake configuration are similar or identical. Note that the explicit multi-gpu option is gone and a multi GPU build is automatically created if you build QUDA using MPI or QMP.
Installation using configure (autoconf) for QUDA 0.7.x (Use as fallback only for 0.8.x)
Installing the library involves running configure
followed by make
. See ./configure --help
for a list of configure options. At a minimum, you'll probably want to set the GPU architecture.
Enabling multi-GPU support requires passing the --enable-multi-gpu
flag to configure, as well as --with-mpi=<PATH>
and optionally --with-qmp=<PATH>
. If the latter is given, QUDA will use QMP for communications; otherwise, MPI will be called directly. By default, it is assumed that the MPI compiler wrappers are <MPI_PATH>/bin/mpicc
and <MPI_PATH>/bin/mpicxx
for C and C++, respectively. These choices may be overriden by setting the CC and CXX variables on the command line as follows:
./configure --enable-multi-gpu --with-mpi=<MPI_PATH> [--with-qmp=<QMP_PATH>] [OTHER_OPTIONS] CC=my_mpicc CXX=my_mpicxx
Finally, with some MPI implementations, executables compiled against MPI will not run without "mpirun". This has the side effect of causing the configure script to believe that the compiler is failing to produce a valid executable. To skip these checks, one can trick configure into thinking that it's cross-compiling by setting the
--build=none
and --host=<HOST>
flags. For the latter, --host=x86_64-linux-gnu
should work on a 64-bit linux system.
By default only the QDP and MILC interfaces are enabled. For interfacing support with QDPJIT, BQCD or CPS; this should be enabled at configure time with the appropriate flag, e.g., --enable-bqcd-interface
. To keep compilation time to a minimum it is recommended to only enable those interfaces that are used by a given application. The QDP and MILC interfaces can be disabled with the, e.g., --disable-milc-interface
flag.
The eigen-vector solvers (eigCG and incremental eigCG) require the installation of the MAGMA dense linear algebra package. Supported versions are MAGMA 1.5.x and 1.6.x, available from http://icl.cs.utk.edu/magma/index.html. MAGMA is enabled using the configure option --with-magma=MAGMA_PATH
.
If Fortran interface support is desired, the F90 environment variable should be set when configure is invoked, and "make fortran" must be run explicitly, since the Fortran interface modules are not built by default.
As examples, the scripts "configure.milc.titan" and "configure.chroma.titan" are provided. These configure QUDA for expected use with MILC and Chroma, respectively, on Titan (the Tesla K20X-powered Cray XK7 supercomputer at the Oak Ridge Leadership Computing Facility).
Throughout the library, auto-tuning is used to select optimal launch parameters for most performance-critical kernels. This tuning process takes some time and will generally slow things down the first time a given kernel is called during a run. To avoid this one-time overhead in subsequent runs (using the same action, solver, lattice volume, etc.), the optimal parameters are cached to disk. For this to work, the QUDA_RESOURCE_PATH environment variable must be set, pointing to a writeable directory. Note that since the tuned parameters are hardware-specific, this "resource directory" should not be shared between jobs running on different systems (e.g., two clusters with different GPUs installed). Attempting to use parameters tuned for one card on a different card may lead to unexpected errors.
QUDA includes a number of internal tests, whose primary goal are correctness and performance testing. As of QUDA, the list of tests stands at
-
dslash_test
: wilson, clover, twisted mass, twisted clover, domain wall, mobius -
staggered_dslash_test
: staggered and improved staggered -
invert_test
: solver test for Wilson-like fermions -
staggered_invert_test
: solver test for staggered-like fermions -
blas_test
: test all blas functions for performance and correctness -
deflation_test
: test for eigCG solver -
fermion_force_test
: test for asqtad fermion force computation (deprecated) -
gauge_force_test
: test for gauge force computation -
hisq_force_paths_test
: HISQ force derivative computation test -
hisq_unitarize_force_test
: HISQ force unitarize test -
llfat_test
: Gauge link fattening for HISQ / asqtad fermions -
su3_test
: Test of SU(3) reconstruction used in dslash_test -
unitarize_link_test
: Test of unitarization used when constructing improved links
Include the header file include/quda.h
in your application, link against lib/libquda.a
, and study tests/invert_test.cpp
(for Wilson, clover, twisted-mass, or domain wall fermions) or tests/staggered_invert_test.cpp
(for asqtad/HISQ fermions) for examples of the solver interface. The various solver options are enumerated in include/enum_quda.h
.
QUDA uses runtime autotuning to maximize performance of each kernel on a given GPU. This bring better performance portability across both GPU architectures and different lattice volumes, parameters, etc. The tunecache.tsv
file is dumped at the end of run in the location specified by the QUDA_RESOURCE_PATH
environment. If this is not specified then autotuning will be cached only within the scope of the run, but lost when the job ends.
QUDA has two specific debugging modes: HOST_DEBUG and DEVICE_DEBUG.
-
HOST_DEBUG compiles all host code using the
-g
flag and ensures that all CUDA error reporting is done synchronously (e.g., the GPU and CPU are synchronized prior to fetching the error state). For most debugging, HOST_DEBUG is all that should be needed since most bugs tend to be in CPU code. There is a noticeable performance impact enabling HOST_DEBUG, at the 20-50% level, with the penalty being greater at smaller local volumes. -
DEVICE_DEBUG compiles all GPU kernels using the
-G
flag. This provides for accurate line reporting in cuda-gdb and cuda-memch. There is a huge performance penalty impact from enabling this, at the 100x level.