QUDA Quick Start Guide

Internal Tests

QUDA includes a number of internal tests, whose primary goal are correctness and performance testing. As of QUDA, the list of tests stands at

dslash_test: wilson, clover, twisted mass, twisted clover, domain wall, mobius
staggered_dslash_test: staggered and improved staggered
invert_test: solver test for Wilson-like fermions
staggered_invert_test: solver test for staggered-like fermions
blas_test: test all blas functions for performance and correctness
deflation_test: test for eigCG solver
fermion_force_test: test for asqtad fermion force computation (deprecated)
gauge_force_test: test for gauge force computation
hisq_force_paths_test: HISQ force derivative computation test
hisq_unitarize_force_test: HISQ force unitarize test
llfat_test: Gauge link fattening for HISQ / asqtad fermions
su3_test: Test of SU(3) reconstruction used in dslash_test
unitarize_link_test: Test of unitarization used when constructing improved links

Kernel Autotuning

QUDA uses runtime autotuning to maximize performance of each kernel on a given GPU. This bring better performance portability across both GPU architectures and different lattice volumes, parameters, etc. The tunecache.tsv file is dumped at the end of run in the location specified by the QUDA_RESOURCE_PATH environment. If this is not specified then autotuning will be cached only within the scope of the run, but lost when the job ends.

Multi-GPU emulation

To aid performance modelling and debugging, it is possible to switch on communication in a given dimension, even if in actuality that dimension is local to a given GPU. The command line flag --partition N facilitates this feature, where N is a 4-bit number, with bits 0,1,2,3 used to switch on/off communication in dimensions x,y,z,t (respectively). For example:

dslash_test --partition 1     ## enable x dimension communication
dslash_test --partition 6     ## enable y and z dimension communication
dslash_test --partition 15    ## enable full communication

Debugging

QUDA has two specific debugging modes: HOST_DEBUG and DEVICE_DEBUG.

HOST_DEBUG compiles all host code using the -g flag and ensures that all CUDA error reporting is done synchronously (e.g., the GPU and CPU are synchronized prior to fetching the error state). For most debugging, HOST_DEBUG is all that should be needed since most bugs tend to be in CPU code. There is a noticeable performance impact enabling HOST_DEBUG, at the 20-50% level, with the penalty being greater at smaller local volumes.
DEVICE_DEBUG compiles all GPU kernels using the -G flag. This provides for accurate line reporting in cuda-gdb and cuda-memch. There is a huge performance penalty impact from enabling this, at the 100x level.

QUDA calls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUDA Quick Start Guide

Internal Tests

Kernel Autotuning

Multi-GPU emulation

Debugging

Clone this wiki locally