diff --git a/doc/sphinx/03_vibe/vibe.rst b/doc/sphinx/03_vibe/vibe.rst index 5fbfb6d0..314df660 100644 --- a/doc/sphinx/03_vibe/vibe.rst +++ b/doc/sphinx/03_vibe/vibe.rst @@ -124,8 +124,33 @@ Results from Parthenon are provided on the following systems: * Commodity Technology System 1 (CTS-1) (Snow) with Intel Broadwell processors, * An Nvidia A100 GPU hosted on an [Nvidia Arm HPC Developer Kit](https://developer.nvidia.com/arm-hpc-devkit) -CTS-1 --------- +ATS-3 Rocinante HBM +------------------- + +.. csv-table:: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes 40% Memory + :file: parthenon-ats5_spr-hbm128-intel-classic.csv + :align: center + :widths: 10, 10 + :header-rows: 1 + +.. figure:: ats3_40.png + :align: center + :scale: 50% + :alt: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes + +.. csv-table:: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes 60% Memory + :file: parthenon-ats5_spr-hbm160-intel-classic.csv + :align: center + :widths: 10, 10 + :header-rows: 1 + +.. figure:: ats3_60.png + :align: center + :scale: 50% + :alt: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes + +CTS-1 Snow +----------- The mesh and meshblock size parameters are chosen to balance realism/performance with memory footprint. For the following tests we @@ -196,31 +221,6 @@ Throughput performance of Parthenon-VIBE on a 40GB A100 is provided within the f :scale: 50% :alt: VIBE Throughput Performance on A100 -ATS-3 ------- - -.. csv-table:: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes 40% Memory - :file: parthenon-ats5_spr-hbm128-intel-classic.csv - :align: center - :widths: 10, 10 - :header-rows: 1 - -.. figure:: ats3_40.png - :align: center - :scale: 50% - :alt: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes - -.. csv-table:: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes 60% Memory - :file: parthenon-ats5_spr-hbm160-intel-classic.csv - :align: center - :widths: 10, 10 - :header-rows: 1 - -.. figure:: ats3_60.png - :align: center - :scale: 50% - :alt: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes - Verification of Results ======================= diff --git a/doc/sphinx/10_microbenchmarks/M1_STREAM/STREAM.rst b/doc/sphinx/10_microbenchmarks/M1_STREAM/STREAM.rst index bccba2e7..d3dab238 100644 --- a/doc/sphinx/10_microbenchmarks/M1_STREAM/STREAM.rst +++ b/doc/sphinx/10_microbenchmarks/M1_STREAM/STREAM.rst @@ -42,42 +42,67 @@ The primary FOM is the Triad rate (MB/s). Building ======== -Adjustments to GOMP_CPU_AFFINITY may also be necessary. +Adjustments to GOMP_CPU_AFFINITY may be necessary. -You can modify the STREAM_ARRAY_SIZE value in the compilation step to change the array size used by the benchmark. Adjusting the array size can help accommodate the available memory on your system. +The STREAM_ARRAY_SIZE value is a critical parameter set at compile time and controls the size of the array used to measure bandwidth. STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer. + +You should adjust the value of 'STREAM_ARRAY_SIZE' (below) to meet BOTH of the following criteria: + +1) Each array must be at least 4 times the size of the available cache memory. I don't worry about the difference between 10^6 and 2^20, so in practice the minimum array size is about 3.8 times the cache size. + (a) Example 1: One Xeon E3 with 8 MB L3 cache STREAM_ARRAY_SIZE should be >= 4 million, giving an array size of 30.5 MB and a total memory requirement of 91.5 MB. + (b) Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) STREAM_ARRAY_SIZE should be >= 20 million, giving an array size of 153 MB and a total memory requirement of 458 MB. +2) The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks. +For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements. + +Set STREAM_ARRAY_SIZE using the -D flag on your compile line. + +Example calculations for results presented here: + +STREAM ARRAY SIZE CALCULATIONS: + +ARRAY_SIZE ~= 4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements = 15000000 + +HASWELL: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz +CACHE: 40M +SOCKETS: 2 +4 * ( 40M * 2 ) / 3 ARRAYS / 8 Bytes/element = 13.4 Mi elements = 13400000 + +BROADWELL: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz +CACHE: 45M +SOCKETS: 2 +4 * ( 45M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 15.0 Mi elements = 15000000 + +SAPPHIRE RAPIDS: Intel(R) Xeon(R) Platinum 8480+ +CACHE: 105 +SOCKETS: 2 +4 x (105M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 35 Mi elements = 35000000 Running ======= .. code-block:: bash - mpirun -np ./stream + srun -n ./stream Replace `` with the number of MPI processes you want to use. For example, if you want to use 4 MPI processes, the command will be: .. code-block:: bash - mpirun -np 4 ./stream - -Input ------ - -Dependent Variable(s) ---------------------- - -1. Maximum bandwidth while utilizing all hardware cores and threads. MAX_BW -2. A minimum number of cores and threads that achieves MAX_BW. MIN_CT + srun -n 4 ./stream Example Results =============== +ATS-3 Rocinante HBM +------------------- + CTS-1 Snow ----------- .. csv-table:: STREAM microbenchmark bandwidth measurement - :file: stream-cts1_ats5intel-oneapi-openmpi.csv + :file: stream_cts1.csv :align: center - :widths: 10, 10 + :widths: 10, 10, 10 :header-rows: 1 .. figure:: cpu_cts1.png @@ -85,6 +110,3 @@ CTS-1 Snow :scale: 50% :alt: STREAM microbenchmark bandwidth measurement -ATS-3 Rocinante HBM -------------------- - diff --git a/doc/sphinx/10_microbenchmarks/M1_STREAM/cpu.gp b/doc/sphinx/10_microbenchmarks/M1_STREAM/cpu.gp index 36f35a67..6af002ff 100644 --- a/doc/sphinx/10_microbenchmarks/M1_STREAM/cpu.gp +++ b/doc/sphinx/10_microbenchmarks/M1_STREAM/cpu.gp @@ -4,13 +4,14 @@ set terminal pngcairo enhanced size 1024, 768 dashed font 'Helvetica,18' set output "cpu_cts1.png" set title "STREAM Single node bandwidth" font "serif,22" -set xlabel "No. Processing Elements" -set ylabel "Figure of Merit Triad (MB/s)" +set ylabel "Per core triad (MB/s)" +set y2label "FOM: Total Triad (MB/s)" -set xrange [1:64] +set xrange [1:40] +set yrange [3000:15000] -set logscale x 2 -set logscale y 2 +# set logscale x 2 +# set logscale y 2 set grid show grid @@ -21,9 +22,7 @@ set key autotitle columnheader set style line 1 linetype 6 dashtype 1 linecolor rgb "#FF0000" linewidth 2 pointtype 6 pointsize 3 set style line 2 linetype 1 dashtype 2 linecolor rgb "#FF0000" linewidth 2 -plot "stream-cts1_ats5intel-oneapi-openmpi.csv" using 1:2 with linespoints linestyle 1 +plot "stream_cts1.csv" using 1:2 with linespoints linestyle 1 axis x1y1, "" using 1:3 with line linestyle 2 axis x1y2 + -# set output "cpu_133M.png" -# set title "Branson Strong Scaling Performance on CTS-1, 133M particles" font "serif,22" -# plot "cpu_133M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2 diff --git a/doc/sphinx/10_microbenchmarks/M1_STREAM/stream_cts1.csv b/doc/sphinx/10_microbenchmarks/M1_STREAM/stream_cts1.csv new file mode 100644 index 00000000..dd40b0ba --- /dev/null +++ b/doc/sphinx/10_microbenchmarks/M1_STREAM/stream_cts1.csv @@ -0,0 +1,8 @@ +No. Cores,Bandwidth (MB/s),Total Bandwidth (MB/s) +1,10690.1,10690.1 +2,10701.3,21402.6 +4,9316.5,37266.0 +8,7884.5,63076.0 +16,7747.5,123960.0 +32,5510.3,176329.6 +36,3189.2,114811.2 \ No newline at end of file diff --git a/doc/sphinx/10_microbenchmarks/M3_DGEMM/DGEMM.rst b/doc/sphinx/10_microbenchmarks/M3_DGEMM/DGEMM.rst index 850b6e73..6acfecab 100644 --- a/doc/sphinx/10_microbenchmarks/M3_DGEMM/DGEMM.rst +++ b/doc/sphinx/10_microbenchmarks/M3_DGEMM/DGEMM.rst @@ -16,6 +16,7 @@ Problem ------- .. math:: + \mathbf{C} = \alpha*\mathbf{A}*\mathbf{B} + \beta*\mathbf{C} Where :math:`A B C` are square :math:`NxN` vectors and :math:`\alpha` and :math:`\beta` are scalars. This operation is repeated :math:`R` times. @@ -30,7 +31,6 @@ GFLOP/s rate: GF/s Run Rules --------- - * Vendors are permitted to change the source code in the region marked in the source. * Optimized BLAS/DGEMM routines are permitted (and encouraged) to demonstrate the highest performance. * Vendors may modify the Makefile(s) as required @@ -40,12 +40,14 @@ Building Makefiles are provided for the intel and gcc compilers. Before building, load the compiler and blas libraries into the PATH and LD_LIBRARY_PATH. -.. code-block:: +.. code-block:: bash cd src patch -p1 < ../dgemm_omp_fixes.patch make +.. + If using a different compiler, copy and modify the simple makefiles to apply the appropriate flags. If using a different blas library than mkl or openblas, modify the C source file to use the correct header and dgemm command. @@ -58,12 +60,18 @@ DGEMM uses OpenMP but does not use MPI. Set the number of OpenMP threads before running. .. code-block:: bash + export OPENBLAS_NUM_THREADS = export OMP_NUM_THREADS = +.. + .. code-block:: bash + ./mt-dgemm +.. + These values default to: :math:`N=256, R=8, \alpha=1.0, \beta=1.0` These inputs are subject to the conditions :math:`N>128, R>4`. diff --git a/doc/sphinx/10_microbenchmarks/M3_DGEMM/cpu.gp b/doc/sphinx/10_microbenchmarks/M3_DGEMM/cpu.gp index e4bb7155..14c4ac35 100644 --- a/doc/sphinx/10_microbenchmarks/M3_DGEMM/cpu.gp +++ b/doc/sphinx/10_microbenchmarks/M3_DGEMM/cpu.gp @@ -1,10 +1,10 @@ #!/usr/bin/gnuplot set terminal pngcairo enhanced size 1024, 768 dashed font 'Helvetica,18' -set output "cpu_66M.png" +set output "dgemm_cts1.png" -set title "Branson Strong Scaling Performance on CTS-1, 66M particles" font "serif,22" +set title " Single node Dgemm" font "serif,22" set xlabel "No. Processing Elements" -set ylabel "Figure of Merit (particles/sec)" +set ylabel "Figure of Merit (GFlops)" set xrange [1:64] set key left top @@ -21,15 +21,7 @@ set key autotitle columnheader set style line 1 linetype 6 dashtype 1 linecolor rgb "#FF0000" linewidth 2 pointtype 6 pointsize 3 set style line 2 linetype 1 dashtype 2 linecolor rgb "#FF0000" linewidth 2 -plot "cpu_66M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2 +#plot "cpu_66M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2 -set output "cpu_133M.png" -set title "Branson Strong Scaling Performance on CTS-1, 133M particles" font "serif,22" -plot "cpu_133M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2 - - -set output "cpu_200M.png" -set title "Branson Strong Scaling Performance on CTS-1, 200M particles" font "serif,22" -plot "cpu_200M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2