-
Notifications
You must be signed in to change notification settings - Fork 101
Running the NERSC MILC Benchmarks
For details on compiling both QUDA and MILC see MILC with QUDA. The NERSC MILC benchmarks can be obtained here, for example via
> wget https://portal.nersc.gov/project/m888/apex/MILC_160413.tgz
[...]
> tar xvzf MILC_160413.tgz
[...]
This tarball includes only an outdated version of the MILC source code and run scripts. As described in MILC with QUDA, one should use the latest develop
branch of MILC. Untarring the NERSC MILC benchmark tar produces a directory MILC-apex
.
With these obtained, the input lattice files can be obtained from here. The relevant input lattice (.chklat file) for the desired benchmark should be placed directly in the MILC-apex/benchmarks/lattices
directory.
Benchmark | Volume | File | Link |
---|---|---|---|
Small | 18x18x18x36 | 18x18x18x36.chklat | here |
Medium | 36x36x36x72 | 36x36x36x72.chklat | here |
Large | 72x72x72x144 | 72x72x72x144.chklat | here |
Grand Challenge | 144x144x144x288 | 144x144x144x288.chklat | here |
For general running tips when using QUDA, we refer to QUDA start guide and multi-GPU guide. Note that on first call to any given GPU kernel, QUDA will autotune the launch parameters for optimal performance, and save these parameters to disk for subsequent reuse. The default name for this cachefile is tunecache.tsv
, and it is saved in the directory specified in the environment variable QUDA_RESOURCE_PATH
as noted in the QUDA start guide.
Autotuning does add some overhead that should not be counted in benchmarking. We recommend therefore doing an initial short tuning run before then rerunning with the original setting for a benchmarking run. This can be accomplished by, for example, modifying the line steps_per_trajectory 30
in MILC-apex/benchmarks/milc_in.sh
to steps_per_trajectory 2
for the tuning run only. Please note that there are two steps_per_trajectory
lines, the first one setting it to 20. That line is not relevant for the benchmark and can be ignored.
The expected performance from running the MILC benchmarks will depend greatly on the node topology of the system being run. Optimal performance will be obtained on systems with first-class GPU Direct RDMA performance, and to a lesser degree, systems that have fully-connected peer-to-peer node topologies. Update this example: For example, Piz Daint can complete the large problem benchmark in around 6300 (update this old number) seconds on 96 nodes. There the limiting factor is the lack of GPU Direct RDMA performance in the 100 KiB-2 MiB message size regime which is critical for scalable performance. This compares to DGX-1 Pascal, which can complete the benchmark in around the same time on half the total number of GPUs owing to improved GPU Direct RDMA performance.
The benchmarks are located in the MILC-apex/benchmarks
directory. As noted above, included with the tarball are the unmodified runscripts and an old version of the MILC source code. Since we will use a separate MILC installation, we need to update the soft links in MILC-apex/benchmarks/*/su3_rhmd_hisq
directories to point to the su3_rhmd_hisq
binary in the QUDA-enabled build of MILC (milc_qcd/ks_imp_rhmc
) directory.
Assuming the necessary input lattices have been fetched using the instructions we are now in a position to run the benchmarks. For quick testing, only the minimal changes listed below are needed. For running at scale (strong scaling, minimizing communication overheads in general), please refer to the subsequent instructions on performance tuning.
As an example, to run the small benchmark on a 3x GPU node using MPI, with a host 6-core process, one need only change the the following lines from the top of run_small.sh
:
total_nodes=1 # one node
cores_per_node=6 # six core processor
numa_per_node=1 # single socket so one numa (also applies to dual-socket with optimized bindings, see below)
threads_per_rank=2 # two threads per rank
run_type=mpirun # launch job using mpirun
In total the number of MPI ranks will be $total_nodes * $cores_per_node / $threads_per_rank
. Depending on your MPI distribution you may need to modify the mpirun
command on line 18 of milc_in.sh
. For example, the script uses -env
to pass environment variables to mpirun, whereas OpenMPI expects -x
. The I_MPI_PIN_DOMAIN socket
setting in the script is for MPICH-based MPI, so when running on OpenMPI, this command should not be included since otherwise this will cause a runtime error. As a note, we will add comments on core/socket bindings in the subsequent section.
Last, before running, you should set the below environment variables. You can either set them on your shell (modifying export
as appropriate if you aren't using bash
), or modify run_[small/medium/etc].sh
before the run_milc
function at the bottom of the script:
export LD_LIBRARY_PATH=${PATH_TO_QUDA_INSTALL}/lib # set path to QUDA lib, only necessary for v1.0.x tags
export QUDA_RESOURCE_PATH=. # set path to write out tunecache.tsv, can be anything
export QUDA_MILC_HISQ_RECONSTRUCT=13 # set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 # set QUDA-MILC solver optimization
export OMP_NUM_THREADS=2 # here two threads per process,
# only relevant for the small parts of the benchmark offloaded to the CPU
More information on these environment variables are available on the QUDA Environment Variables page. The small benchmark can now be run via:
./run_small.sh
For reference, as of the time of writing, this benchmark should take less than 300 seconds to complete on a 3x Quadro V100 system.
The NERSC Medium benchmark is "small" enough to run on a single NVIDIA GPU with >= 32 GB of onboard memory (Quadro V100, Tesla V100-32GB, NVIDIA A100-40GB and -80GB). However, due to the large local volume, autotuning can take upwards of 30 minutes to an hour, during which no output gets printed. Since this is a long time, it may appear that the benchmark is hanging. You can verify that it is still tuning by running nvidia-smi
on the same node as a proxy for querying GPU activity. In any case, please make sure you have given at least 1 hour to tune (either interactively, or via a wall-time of at least 1 hour) before reaching out about the benchmark appearing to hang.
The NERSC Medium benchmark can be run on single GPUs with less than 32 GB of onboard memory by disabling the device pool manager and enabling managed memory support. This is only recommended if absolutely necessary. Do not run this way "for fun". This can be accomplished by setting the following two environment variables:
export QUDA_ENABLE_DEVICE_MEMORY_POOL=0
export QUDA_ENABLE_MANAGED_MEMORY=1
More information on these environment variables is available on the QUDA Environment Variables page.
For running on systems with one GPU and one CPU per node, e.g., Cray systems, we can use the run scripts as provided with minor modifications. We set the geometry with an additional parameter to the binary, and parametrize this with new parameters px, py, pz, and pt which are now set in the appropriate run.sh scripts.
The listings below show the modified milc_in.sh (in MILC_APEX/benchmarks
directory) and run_large.sh (in MILC_APEX/benchmarks/large
) scripts required for running the large problem MILC benchmark on 96 nodes of Piz Daint using QUDA. The required environmental variables for running with QUDA are set in run_large.sh.
Listing 1: Modified milc_in.sh script that parametrizes the process topology. Additions are marked in green.
#!/bin/bash
#
# The following need to be defined before calling run_milc
# nx, ny, nz, nt - problem dimensions
# warms - # of warmups to perform
# trajecs - # of trajectories to run
# checklat - lattice checkpoint file name
# save_cmd - set to "forget" or "save_serial $checklat"
# reload_cmd - set to "reload_parallel $checklat" or "continue"
#
# choose which command to execute by setting run_type before calling
function run_milc() {
if [ "$run_type" == "srun" ]; then
- command="srun -n $N --ntasks-per-socket=$S ./su3_rhmd_hisq"
+ command="srun -n $N --ntasks-per-socket=$S ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
elif [ "$run_type" == "aprun" ]; then
- command="aprun -n $N -S $S -d $threads_per_rank -cc numa_node ./su3_rhmd_hisq"
+ command="aprun -n $N -S $S -d $threads_per_rank -cc numa_node ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
else
- command="mpirun -n $N -env I_MPI_PIN_DOMAIN socket -env OMP_NUM_THREADS=$threads_per_rank ./su3_rhmd_hisq"
+ command="mpirun -n $N -env I_MPI_PIN_DOMAIN socket -env OMP_NUM_THREADS=$threads_per_rank ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
fi
echo "$0: Running \"${command}\" "
if [ "$debug" == "true" ]; then
command="cat ";
echo "Debugging script with cat"
fi
$command <<EOI
prompt 0
nx $nx
ny $ny
nz $nz
nt $nt
iseed 4563421
n_pseudo 5
load_rhmc_params rationals_m001m05m5.test1
beta 6.3
n_dyn_masses 3
dyn_mass .001 .05 .5
dyn_flavors 2 1 1
u0 0.890
warms $warms
trajecs 0
traj_between_meas 40
microcanonical_time_step .05
steps_per_trajectory 20
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
prec_ff 2
number_of_pbp_masses 0
fresh
$save_cmd
warms 0
trajecs $trajecs
traj_between_meas 2
microcanonical_time_step .0333
steps_per_trajectory 30
cgresid_md_fa_gr 1e-5 1e-5 1e-5
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
prec_ff 2
number_of_pbp_masses 3
max_cg_prop 500
max_cg_prop_restarts 1
npbp_reps 1
prec_pbp 2
mass 0.01
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
mass 0.05
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
mass 0.5
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
$reload_cmd
forget
EOI
}
Listing 2: Modified run_large.sh script, suitable for 96 nodes of Piz Daint.
#!/bin/bash -l
#SBATCH --partition=regular
#SBATCH --nodes=96
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=12
#SBATCH --time=04:00:00
#SBATCH --job-name=milc-large
#SBATCH --partition=normal
#SBATCH --constraint=gpu
export QUDA_ENABLE_DEVICE_MEMORY_POOL=1 # to fit large problem on 32 nodes set this to 0
export QUDA_MILC_HISQ_RECONSTRUCT=13 # set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 # set QUDA-MILC solver optimization
export QUDA_RESOURCE_PATH=. # set path for storing tunecache file
export QUDA_ENABLE_GDR=1 # enable GPU Direct RDMA (no benefit on Cray)
export CUDA_DEVICE_MAX_CONNECTIONS=1
export MPICH_RDMA_ENABLED_CUDA=1 # enable GPU Direct RDMA (no benefit on Cray)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load daint-gpu
# load the run_milc() function
. milc_in.sh
#----- START USER DEFINED PARAMETERS -----#
#build_lattice=true # Set only if you want to create a lattice file
# Comment out otherwise
# You really only want to do this once
total_nodes=96 # match above allocation request!
cores_per_node=12
numa_per_node=1
threads_per_rank=12
run_type=srun # comment out to debug script
# Parse command line parameters
debug="false"
while [ "$#" -ge 1 ] ; do
case "$1" in
"--build" ) build_lattice=true; shift 1;;
"--debug" ) debug="true"; shift 1;;
"--cores" | "-c" ) cores_per_node=$2; shift 2;;
"--threads" | "-t" ) threads_per_rank=$2; shift 2;;
*) break;;
esac
done
#----- END USER DEFINED PARAMETERS -----#
# Set problem size
nx=72
ny=72
nz=72
nt=144
# set process topology
px=2
py=2
pz=3
pt=8
N=$(( $total_nodes*$cores_per_node/$threads_per_rank )) # total ranks
S=$(( $N/$total_nodes/$numa_per_node )) # ranks per socket
# sanity check for parameters
if [ $(( $N*$threads_per_rank )) -gt $(( total_nodes*$cores_per_node )) ]; then
echo "$0: ERROR, number of threads exceeds total concurrency, aborting"
exit;
else
echo "$0: Using $N MPI rank(s), $S rank(s) per NUMA domain, $threads_per_rank thread(s) per rank"
fi
dimension=${nx}x${ny}x${nz}x${nt}
checklat=$dimension.chklat
if [ "$build_lattice" == "true" ]; then
warms=40
trajecs=2
save_cmd="save_serial $checklat"
reload_cmd="continue"
echo "$0: Building lattice $dimension on $total_nodes nodes"
else
warms=0
trajecs=2
save_cmd="forget"
reload_cmd="reload_parallel $checklat"
echo "$0: Running lattice $dimension on $total_nodes nodes"
fi
# Run MILC
export OMP_NUM_THREADS=$threads_per_rank
echo -n "$0: Begin time: "; date
run_milc
echo -n "$0: End time: "; date
See also the discussion here for instructions on how to use OpenMPI/UCX and/or MVAPICH MPI.
On systems with multiple GPUs and / or multiple NICs, it is important to bind each process to the correct CPU core, and to ensure that a given GPU uses the correct NIC to minimize on-node memory traffic and maximize inter-node bandwidth. Peer-to-peer communication and GPU Direct RDMA performance are critical for the scalability of MILC.
In order to achieve correct NUMA placement we have found numactl to be most robust method on systems with multiple GPUs and CPUs. To this end, we found it easier to use a separate input parameter file as opposed to relying on the milc_in.sh script (which generates the implicit input file). We thus use the run.sh script (Listing 4) which sets the process affinity, assigns processes to NICs as well as setting any required environment variables for QUDA.
As an example, we now consider running the medium benchmark using 8 GPUs. With a global problem size 36x36x36x72, we decompose the problem onto 8 processes. In general, we seek a local volume that minimizes surface area (e.g., symmetric decompositions), while also minimizing the splitting of the fastest running X dimension. E.g., 1x1x2x4 is superior to 2x2x1x2, although they have equal volumes, the former does not split the X dimension and so there are no strided memory access patterns when updating the halos. We set the node topology with the node_geometry option in the sixth line of the input. With a topology 1x1x2x4 this gives each GPU a problem of size 36x36x18x18.
Another optimization point to be aware of is the node topology: QUDA (and therefore MILC) take advantage of peer-to-peer connections between GPUs in a node. On systems that are not fully connected, e.g., GPUs on either side of QPI, there can be a large performance impact versus having the GPUs directly connected (via NVLink or PCIe). At startup QUDA will list which process pairs have peer-to-peer connectively allowing visual confirmation of the node topology.
Similarly, for optimal GPU Direct RDMA performance, we need to ensure that the MPI process assigned to each GPU uses the closest NIC to that GPU. This can be achieved through setting the “OMPI_MCA_btl_openib_if_include” to the appropriate NIC using OpenMPI. In Table 3 below we include the appropriate OpenMPI flags for enabling GPU Direct RDMA.
The following is an example script that was used to launch an 8 GPU job running on a DGX-1 system using numactl to achieve the correct binding. This system has 8 GPUs in a node, with pairs of GPUs connected to a common NIC.
Listing 3 job launch script (e.g., for slurm)
#!/bin/bash
module purge
module load gcc/6.2.0
module load cuda/9.0.103
module load openmpi/1.10.7
mpirun -np 8 -npernode 8 --bind-to none --mca btl sm,self,openib \
--mca btl_openib_want_cuda_gdr 1 --mca btl_openib_cuda_rdma_limit 4000000000 \
-x APP="./su3_rhmd_hisq input" ./run.sh
Table 4 "run.sh" script
#!/bin/bash
# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4
export QUDA_ENABLE_DEVICE_MEMORY_POOL=1 # to fit large problem on 32 nodes set this to 0
export CUDA_DEVICE_MAX_CONNECTIONS=1
# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.
# enable GDR support
export QUDA_ENABLE_GDR=1
# set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT=13 # set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 # set QUDA-MILC solver optimization
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
case ${lrank} in
[0])
export OMPI_MCA_btl_openib_if_include=mlx5_0
numactl --physcpubind=1-4 \
$APP
;;
[1])
export OMPI_MCA_btl_openib_if_include=mlx5_0
numactl --physcpubind=5-8 \
$APP
;;
[2])
export OMPI_MCA_btl_openib_if_include=mlx5_1
numactl --physcpubind=10-13 \
$APP
;;
[3])
export OMPI_MCA_btl_openib_if_include=mlx5_1
numactl --physcpubind=15-18 \
$APP
;;
[4])
export OMPI_MCA_btl_openib_if_include=mlx5_2
numactl --physcpubind=21-24 \
$APP
;;
[5])
export OMPI_MCA_btl_openib_if_include=mlx5_2
numactl --physcpubind=25-28 \
$APP
;;
[6])
export OMPI_MCA_btl_openib_if_include=mlx5_3
numactl --physcpubind=30-33 \
$APP
;;
[7])
export OMPI_MCA_btl_openib_if_include=mlx5_3
numactl --physcpubind=35-38 \
$APP
;;
esac
Listing 4: Medium input file. A minimum of 2x (16 GiB) GPUs are required.
prompt 0
nx 36
ny 36
nz 36
nt 72
node_geometry 1 1 2 4
iseed 4563421
n_pseudo 5
load_rhmc_params rationals_m001m05m5.test1
beta 6.3
n_dyn_masses 3
dyn_mass .001 .05 .5
dyn_flavors 2 1 1
u0 0.890
warms 0
trajecs 0
traj_between_meas 40
microcanonical_time_step .05
steps_per_trajectory 20
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
prec_ff 2
number_of_pbp_masses 0
fresh
forget
warms 0
trajecs 2
traj_between_meas 2
microcanonical_time_step .0333
steps_per_trajectory 30
cgresid_md_fa_gr 1e-5 1e-5 1e-5
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
prec_ff 2
number_of_pbp_masses 3
max_cg_prop 500
max_cg_prop_restarts 1
npbp_reps 1
prec_pbp 2
mass 0.01
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
mass 0.05
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
mass 0.5
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
reload_parallel 36x36x36x72.chklat
forget
Listing 5: Large input file: the only difference from the medium input file above are the problem sizes (nx,ny,nz,nt) and the name of the checkpoint file to use (reload_parallel). A minimum of 32x (16 GiB) GPUs are required.
prompt 0
nx 72
ny 72
nz 72
nt 144
node_geometry 1 2 2 8
iseed 4563421
n_pseudo 5
load_rhmc_params rationals_m001m05m5.test1
beta 6.3
n_dyn_masses 3
dyn_mass .001 .05 .5
dyn_flavors 2 1 1
u0 0.890
warms 0
trajecs 0
traj_between_meas 40
microcanonical_time_step .05
steps_per_trajectory 20
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1 1 1
max_multicg_md_fa_gr 5 5 5
cgprec_md_fa_gr 2 2 2
prec_ff 2
number_of_pbp_masses 0
fresh
forget
warms 0
trajecs 2
traj_between_meas 2
microcanonical_time_step .0333
steps_per_trajectory 30
cgresid_md_fa_gr 1e-5 1e-5 1e-5
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
cgresid_md_fa_gr 1e-7 1e-7 1e-7
max_multicg_md_fa_gr 2500 2500 2500
cgprec_md_fa_gr 2 2 2
prec_ff 2
number_of_pbp_masses 3
max_cg_prop 500
max_cg_prop_restarts 1
npbp_reps 1
prec_pbp 2
mass 0.01
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
mass 0.05
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
mass 0.5
naik_term_epsilon 0
error_for_propagator 1e-8
rel_error_for_propagator 0
reload_parallel 72x72x72x144.chklat
forget