Running the NERSC MILC Benchmarks

Obtaining the Benchmarks

For details on compiling both QUDA and MILC see here. The NERSC MILC benchmarks can be obtained here. This tarball includes only the MILC source code and run scripts. With these obtained, the input lattice files can be obtained from here. The relevant input lattice (.chklat file) for the desired benchmark should be placed in the MILC-apex/benchmarks/lattices directory.

Benchmark	Volume	File	Link
Small	18x18x18x36	18x18x18x36.chklat	here
Medium	36x36x36x72	36x36x36x72.chklat	here
Large	72x72x72x144	72x72x72x72.chklat	here
Grand Challenge	144x144x144x288	144x144x144x288.chklat	here

Background

For general running tips when using QUDA, we refer to QUDA start guide and multi-GPU guide. Note that on first call to any given GPU kernel, QUDA will autotune the launch parameters for optimal performance, and saving out these parameters to disk for subsequent reuse. This adds some overhead that should not be counted in benchmarking. We recommend therefore doing an initial short tuning run (e.g., set the second steps_per_trajectory in the input to 2) before then rerunning with the original setting for a benchmarking run.

The expected performance from running the MILC benchmarks will depend greatly on the node topology of the system being run. Optimal performance will be obtained on systems with first-class GPU Direct RDMA performance, and to a lesser degree, systems that have fully-connected peer-to-peer node topologies. For example, Piz Daint can complete the large problem benchmark in around ~~6300~~ (update this old number) seconds on 96 nodes. There the limiting factor is the lack of GPU Direct RDMA performance in the 100 KiB-2 MiB message size regime which is critical for scalable performance. This compares to DGX-1 Pascal, which can complete the benchmark in around the same time on half the total number of GPUs owing to improved GPU Direct RDMA performance.

Running the Benchmarks

The benchmarks are located in the MILC-apex/benchmarks directory. Included with the tarball are the unmodified runscripts and an old version of the MILC source code. Since we will use a separate MILC installation, we need to update the soft links in MILC-apex/benchmarks/*/su3_rhmd_hisq directories to point to the su3_rhmd_hisq binary in the QUDA-enabled build of MILC (milc_qcd/ks_imp_rhmc) directory.

Assuming the necessary input lattices have been fetched using the instructions we are now in a position to run the benchmarks. For quick testing, only minimal changes to the run scripts should be needed. The changes below are intended for performance tuning, especially with respect to strong scaling and minimizing communication overhead.

For example, to run the small benchmark on a 3x GPU node using MPI, with a host 6-core process, one need only change the the following lines from the top of run_small.sh

  total_nodes=1              # one node
  cores_per_node=6           # six core processor
  numa_per_node=1            # single socket so one numa
  threads_per_rank=2         # two threads per rank
  run_type=mpirun            # launch job using mpirun

Then simply running with the following should work

export LD_LIBRARY_PATH=${PATH_TO_QUDA_INSTALL}/lib # set path to QUDA lib
export QUDA_RESOURCE_PATH=.                        # set path to write out tunecache.tsv
export QUDA_MILC_HISQ_RECONSTRUCT=13               # set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9         # set QUDA-MILC solver optimization
export OMP_NUM_THREADS=2                           # two threads per process
./run_small.sh

For reference, as of the time of writing, this benchmark should take less than 300 seconds to complete on a 3x Quadro V100 system.

Single GPU / single CPU systems

For running on systems with one GPU and one CPU per node, e.g., Cray systems, we can use the run scripts as provided with minor modifications. We set the geometry with an additional parameter to the binary, and parametrize this with new parameters px, py, pz, and pt which are now set in the appropriate run.sh scripts.

The listings below show the modified milc_in.sh (in MILC_APEX/benchmarks directory) and run_large.sh (in MILC_APEX/benchmarks/large) scripts required for running the large problem MILC benchmark on 96 nodes of Piz Daint using QUDA. The required environmental variables for running with QUDA are set in run_large.sh.

Listing 1: Modified milc_in.sh script that parametrizes the process topology. Additions are marked in green.

#!/bin/bash
#
# The following need to be defined before calling run_milc
# nx, ny, nz, nt - problem dimensions
# warms - # of warmups to perform
# trajecs - # of trajectories to run
# checklat - lattice checkpoint file name
# save_cmd - set to "forget" or "save_serial $checklat"
# reload_cmd - set to "reload_parallel $checklat" or "continue"
#
# choose which command to execute by setting run_type before calling
function run_milc() {
    if [ "$run_type" == "srun" ]; then
-     command="srun -n $N --ntasks-per-socket=$S ./su3_rhmd_hisq"
+     command="srun -n $N --ntasks-per-socket=$S ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
    elif [ "$run_type" == "aprun" ]; then
-     command="aprun -n $N -S $S -d $threads_per_rank -cc numa_node ./su3_rhmd_hisq"
+     command="aprun -n $N -S $S -d $threads_per_rank -cc numa_node ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
    else
-     command="mpirun -n $N -env I_MPI_PIN_DOMAIN socket -env OMP_NUM_THREADS=$threads_per_rank ./su3_rhmd_hisq"
+     command="mpirun -n $N -env I_MPI_PIN_DOMAIN socket -env OMP_NUM_THREADS=$threads_per_rank ./su3_rhmd_hisq -qmp-geom $px $py $pz $pt"
    fi
    echo "$0: Running \"${command}\" "
    if [ "$debug" == "true" ]; then
      command="cat ";
      echo "Debugging script with cat"
    fi
    $command <<EOI
    prompt 0
    nx $nx
    ny $ny
    nz $nz
    nt $nt
    iseed 4563421
    n_pseudo 5
    load_rhmc_params rationals_m001m05m5.test1
    beta 6.3
    n_dyn_masses 3
    dyn_mass .001 .05 .5
    dyn_flavors 2 1 1
    u0  0.890
 
    warms $warms
    trajecs 0
    traj_between_meas 40
    microcanonical_time_step .05
    steps_per_trajectory 20
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    prec_ff 2
    number_of_pbp_masses 0
    fresh
    $save_cmd

    warms 0
    trajecs $trajecs
    traj_between_meas 2
    microcanonical_time_step .0333
    steps_per_trajectory 30
    cgresid_md_fa_gr 1e-5 1e-5 1e-5
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    prec_ff 2
    number_of_pbp_masses 3
    max_cg_prop 500
    max_cg_prop_restarts 1
    npbp_reps 1
    prec_pbp 2
    mass 0.01
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    mass 0.05
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    mass 0.5
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    $reload_cmd
    forget
EOI
}

Listing 2: Modified run_large.sh script, suitable for 96 nodes of Piz Daint.

#!/bin/bash -l
#SBATCH --partition=regular
#SBATCH --nodes=96
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=12
#SBATCH --time=04:00:00
#SBATCH --job-name=milc-large
#SBATCH --partition=normal
#SBATCH --constraint=gpu

export QUDA_ENABLE_DEVICE_MEMORY_POOL=1 # to fit large problem on 32 nodes set this to 0
export QUDA_MILC_HISQ_RECONSTRUCT=13 # set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 # set QUDA-MILC solver optimization
export QUDA_RESOURCE_PATH=. # set path for storing tunecache file
export QUDA_ENABLE_GDR=1 # enable GPU Direct RDMA (no benefit on Cray)
export CUDA_DEVICE_MAX_CONNECTIONS=1

export MPICH_RDMA_ENABLED_CUDA=1 # enable GPU Direct RDMA (no benefit on Cray)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load daint-gpu

  # load the run_milc() function
  . milc_in.sh

  #----- START USER DEFINED PARAMETERS -----#
  #build_lattice=true        # Set only if you want to create a lattice file
                            # Comment out otherwise
                            # You really only want to do this once

  total_nodes=96             # match above allocation request!
  cores_per_node=12
  numa_per_node=1
  threads_per_rank=12
  run_type=srun            # comment out to debug script

  # Parse command line parameters
  debug="false"
  while [ "$#" -ge 1 ] ; do
    case "$1" in
      "--build" )           build_lattice=true; shift 1;;
      "--debug" )           debug="true"; shift 1;;
      "--cores" | "-c" )    cores_per_node=$2; shift 2;;
      "--threads" | "-t" )  threads_per_rank=$2; shift 2;;
      *)                    break;;
    esac
  done

  #----- END USER DEFINED PARAMETERS -----#

  # Set problem size
  nx=72
  ny=72
  nz=72
  nt=144

  # set process topology
  px=2
  py=2
  pz=3
  pt=8

  N=$(( $total_nodes*$cores_per_node/$threads_per_rank ))  # total ranks
  S=$(( $N/$total_nodes/$numa_per_node ))                 # ranks per socket

  # sanity check for parameters
  if [ $(( $N*$threads_per_rank )) -gt $(( total_nodes*$cores_per_node )) ]; then
    echo "$0: ERROR, number of threads exceeds total concurrency, aborting"
    exit;
  else
    echo "$0: Using $N MPI rank(s), $S rank(s) per NUMA domain, $threads_per_rank thread(s) per rank"
  fi

  dimension=${nx}x${ny}x${nz}x${nt}
  checklat=$dimension.chklat
  if [ "$build_lattice" == "true" ]; then
    warms=40
    trajecs=2
    save_cmd="save_serial $checklat"
    reload_cmd="continue"
    echo "$0: Building lattice $dimension on $total_nodes nodes"
  else
    warms=0
    trajecs=2
    save_cmd="forget"
    reload_cmd="reload_parallel $checklat"
    echo "$0: Running lattice $dimension on $total_nodes nodes"
  fi

  # Run MILC
  export OMP_NUM_THREADS=$threads_per_rank
  echo -n "$0: Begin time: "; date
  run_milc
  echo -n "$0: End time: "; date

Multi-GPU and / or multi-GPU dual-socket systems

On systems with multiple GPUs and / or multiple NICs, it is important to bind each process to the correct CPU core, and to ensure that a given GPU uses the correct NIC to minimize on-node memory traffic and maximize inter-node bandwidth. Peer-to-peer communication and GPU Direct RDMA performance are critical for the scalability of MILC.

In order to achieve correct NUMA placement we have found numactl to be most robust method on systems with multiple GPUs and CPUs. To this end, we found it easier to use a separate input parameter file as opposed to relying on the milc_in.sh script (which generates the implicit input file). We thus use the run.sh script (Listing 4) which sets the process affinity, assigns processes to NICs as well as setting any required environment variables for QUDA.

As an example, we now consider running the medium benchmark using 8 GPUs. With a global problem size 36x36x36x72, we decompose the problem onto 8 processes. In general, we seek a local volume that minimizes surface area (e.g., symmetric decompositions), while also minimizing the splitting of the fastest running X dimension. E.g., 1x1x2x4 is superior to 2x2x1x2, although they have equal volumes, the former does not split the X dimension and so there are no strided memory access patterns when updating the halos. We set the node topology with the node_geometry option in the sixth line of the input. With a topology 1x1x2x4 this gives each GPU a problem of size 36x36x18x18.

Another optimization point to be aware of is the node topology: QUDA (and therefore MILC) take advantage of peer-to-peer connections between GPUs in a node. On systems that are not fully connected, e.g., GPUs on either side of QPI, there can be a large performance impact versus having the GPUs directly connected (via NVLink or PCIe). At startup QUDA will list which process pairs have peer-to-peer connectively allowing visual confirmation of the node topology.

Similarly, for optimal GPU Direct RDMA performance, we need to ensure that the MPI process assigned to each GPU uses the closest NIC to that GPU. This can be achieved through setting the “OMPI_MCA_btl_openib_if_include” to the appropriate NIC using OpenMPI. In Table 3 below we include the appropriate OpenMPI flags for enabling GPU Direct RDMA.

The following is an example script that was used to launch an 8 GPU job running on a DGX-1 system using numactl to achieve the correct binding. This system has 8 GPUs in a node, with pairs of GPUs connected to a common NIC.

Listing 3 job launch script (e.g., for slurm)

#!/bin/bash
module purge
module load gcc/6.2.0
module load cuda/9.0.103
module load openmpi/1.10.7
mpirun -np 8 -npernode 8 --bind-to none --mca btl sm,self,openib \
   --mca btl_openib_want_cuda_gdr 1 --mca btl_openib_cuda_rdma_limit 4000000000 \
    -x APP="./su3_rhmd_hisq input" ./run.sh


Table 4  "run.sh" script
#!/bin/bash
# Number of physical CPU cores per GPU
export OMP_NUM_THREADS=4

export QUDA_ENABLE_DEVICE_MEMORY_POOL=1 # to fit large problem on 32 nodes set this to 0
export CUDA_DEVICE_MAX_CONNECTIONS=1

# set the QUDA tunecache path
export QUDA_RESOURCE_PATH=.

# enable GDR support
export QUDA_ENABLE_GDR=1

# set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT=13 # set QUDA-MILC solver optimization
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9 # set QUDA-MILC solver optimization

lrank=$OMPI_COMM_WORLD_LOCAL_RANK
case ${lrank} in
[0])
export OMPI_MCA_btl_openib_if_include=mlx5_0
  numactl --physcpubind=1-4 \
  $APP
  ;;
[1])
export OMPI_MCA_btl_openib_if_include=mlx5_0
  numactl --physcpubind=5-8 \
  $APP
  ;;
[2])
export OMPI_MCA_btl_openib_if_include=mlx5_1
  numactl --physcpubind=10-13 \
  $APP
  ;;
[3])
export OMPI_MCA_btl_openib_if_include=mlx5_1
  numactl --physcpubind=15-18 \
  $APP
  ;;
[4])
export OMPI_MCA_btl_openib_if_include=mlx5_2
  numactl --physcpubind=21-24 \
$APP
  ;;
[5])
export OMPI_MCA_btl_openib_if_include=mlx5_2
  numactl --physcpubind=25-28 \
  $APP
  ;;
[6])
export OMPI_MCA_btl_openib_if_include=mlx5_3
  numactl --physcpubind=30-33 \
  $APP
  ;;
[7])
export OMPI_MCA_btl_openib_if_include=mlx5_3
  numactl --physcpubind=35-38 \
  $APP
  ;;
esac

Listing 4: Medium input file. A minimum of 2x (16 GiB) GPUs are required.

    prompt 0
    nx 36
    ny 36
    nz 36
    nt 72
    node_geometry 1 1 2 4
    iseed 4563421
    n_pseudo 5
    load_rhmc_params rationals_m001m05m5.test1
    beta 6.3
    n_dyn_masses 3
    dyn_mass .001 .05 .5
    dyn_flavors 2 1 1
    u0  0.890
 
    warms 0
    trajecs 0
    traj_between_meas 40
    microcanonical_time_step .05
    steps_per_trajectory 20
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    prec_ff 2
    number_of_pbp_masses 0
    fresh
    forget

    warms 0
    trajecs 2
    traj_between_meas 2
    microcanonical_time_step .0333
    steps_per_trajectory 30
    cgresid_md_fa_gr 1e-5 1e-5 1e-5
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    prec_ff 2
    number_of_pbp_masses 3
    max_cg_prop 500
    max_cg_prop_restarts 1
    npbp_reps 1
    prec_pbp 2
    mass 0.01
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    mass 0.05
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    mass 0.5
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    reload_parallel 36x36x36x72.chklat
    forget

Listing 5: Large input file: the only difference from the medium input file above are the problem sizes (nx,ny,nz,nt) and the name of the checkpoint file to use (reload_parallel). A minimum of 32x (16 GiB) GPUs are required.

    prompt 0
    nx 72
    ny 72
    nz 72
    nt 144
    node_geometry 1 2 2 8
    iseed 4563421
    n_pseudo 5
    load_rhmc_params rationals_m001m05m5.test1
    beta 6.3
    n_dyn_masses 3
    dyn_mass .001 .05 .5
    dyn_flavors 2 1 1
    u0  0.890
 
    warms 0
    trajecs 0
    traj_between_meas 40
    microcanonical_time_step .05
    steps_per_trajectory 20
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1 1 1
    max_multicg_md_fa_gr  5  5  5
    cgprec_md_fa_gr  2 2 2
    prec_ff 2
    number_of_pbp_masses 0
    fresh
    forget

    warms 0
    trajecs 2
    traj_between_meas 2
    microcanonical_time_step .0333
    steps_per_trajectory 30
    cgresid_md_fa_gr 1e-5 1e-5 1e-5
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    cgresid_md_fa_gr 1e-7 1e-7 1e-7
    max_multicg_md_fa_gr  2500  2500  2500
    cgprec_md_fa_gr  2 2 2
    prec_ff 2
    number_of_pbp_masses 3
    max_cg_prop 500
    max_cg_prop_restarts 1
    npbp_reps 1
    prec_pbp 2
    mass 0.01
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    mass 0.05
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    mass 0.5
    naik_term_epsilon 0
    error_for_propagator 1e-8
    rel_error_for_propagator 0
    reload_parallel 72x72x72x144.chklat
    forget

QUDA calls

Provide feedback

Saved searches