Skip to content

Commit

Permalink
The commit for STREAM and DGEMM docs was lost. Recovered it and recom…
Browse files Browse the repository at this point in the history
…mitting.
  • Loading branch information
dmageeLANL committed Sep 19, 2023
1 parent 6b066dd commit b38b6b5
Show file tree
Hide file tree
Showing 6 changed files with 97 additions and 68 deletions.
54 changes: 27 additions & 27 deletions doc/sphinx/03_vibe/vibe.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,33 @@ Results from Parthenon are provided on the following systems:
* Commodity Technology System 1 (CTS-1) (Snow) with Intel Broadwell processors,
* An Nvidia A100 GPU hosted on an [Nvidia Arm HPC Developer Kit](https://developer.nvidia.com/arm-hpc-devkit)
CTS-1
--------
ATS-3 Rocinante HBM
-------------------
.. csv-table:: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes 40% Memory
:file: parthenon-ats5_spr-hbm128-intel-classic.csv
:align: center
:widths: 10, 10
:header-rows: 1
.. figure:: ats3_40.png
:align: center
:scale: 50%
:alt: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes
.. csv-table:: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes 60% Memory
:file: parthenon-ats5_spr-hbm160-intel-classic.csv
:align: center
:widths: 10, 10
:header-rows: 1
.. figure:: ats3_60.png
:align: center
:scale: 50%
:alt: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes
CTS-1 Snow
-----------
The mesh and meshblock size parameters are chosen to balance
realism/performance with memory footprint. For the following tests we
Expand Down Expand Up @@ -196,31 +221,6 @@ Throughput performance of Parthenon-VIBE on a 40GB A100 is provided within the f
:scale: 50%
:alt: VIBE Throughput Performance on A100
ATS-3
------
.. csv-table:: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes 40% Memory
:file: parthenon-ats5_spr-hbm128-intel-classic.csv
:align: center
:widths: 10, 10
:header-rows: 1
.. figure:: ats3_40.png
:align: center
:scale: 50%
:alt: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes
.. csv-table:: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes 60% Memory
:file: parthenon-ats5_spr-hbm160-intel-classic.csv
:align: center
:widths: 10, 10
:header-rows: 1
.. figure:: ats3_60.png
:align: center
:scale: 50%
:alt: VIBE Throughput Performance on ATS-3 Rocinante HBM nodes
Verification of Results
=======================
Expand Down
58 changes: 40 additions & 18 deletions doc/sphinx/10_microbenchmarks/M1_STREAM/STREAM.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,49 +42,71 @@ The primary FOM is the Triad rate (MB/s).
Building
========

Adjustments to GOMP_CPU_AFFINITY may also be necessary.
Adjustments to GOMP_CPU_AFFINITY may be necessary.

You can modify the STREAM_ARRAY_SIZE value in the compilation step to change the array size used by the benchmark. Adjusting the array size can help accommodate the available memory on your system.
The STREAM_ARRAY_SIZE value is a critical parameter set at compile time and controls the size of the array used to measure bandwidth. STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.

You should adjust the value of 'STREAM_ARRAY_SIZE' (below) to meet BOTH of the following criteria:

1) Each array must be at least 4 times the size of the available cache memory. I don't worry about the difference between 10^6 and 2^20, so in practice the minimum array size is about 3.8 times the cache size.
(a) Example 1: One Xeon E3 with 8 MB L3 cache STREAM_ARRAY_SIZE should be >= 4 million, giving an array size of 30.5 MB and a total memory requirement of 91.5 MB.
(b) Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP) STREAM_ARRAY_SIZE should be >= 20 million, giving an array size of 153 MB and a total memory requirement of 458 MB.
2) The size should be large enough so that the 'timing calibration' output by the program is at least 20 clock-ticks.
For example, most versions of Windows have a 10 millisecond timer granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds. If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec. This means the each array must be at least 1 GB, or 128M elements.

Set STREAM_ARRAY_SIZE using the -D flag on your compile line.

Example calculations for results presented here:

STREAM ARRAY SIZE CALCULATIONS:

ARRAY_SIZE ~= 4 x (45 MiB cache / processor) x (2 processors) / (3 arrays) / (8 bytes / element) = 15 Mi elements = 15000000

HASWELL: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
CACHE: 40M
SOCKETS: 2
4 * ( 40M * 2 ) / 3 ARRAYS / 8 Bytes/element = 13.4 Mi elements = 13400000

BROADWELL: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
CACHE: 45M
SOCKETS: 2
4 * ( 45M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 15.0 Mi elements = 15000000

SAPPHIRE RAPIDS: Intel(R) Xeon(R) Platinum 8480+
CACHE: 105
SOCKETS: 2
4 x (105M * 2 ) / 3 ARRAYS / 8 BYTES/ELEMENT = 35 Mi elements = 35000000

Running
=======

.. code-block:: bash
mpirun -np <num_processes> ./stream
srun -n <num_processes> ./stream
Replace `<num_processes>` with the number of MPI processes you want to use. For example, if you want to use 4 MPI processes, the command will be:

.. code-block:: bash
mpirun -np 4 ./stream
Input
-----

Dependent Variable(s)
---------------------

1. Maximum bandwidth while utilizing all hardware cores and threads. MAX_BW
2. A minimum number of cores and threads that achieves MAX_BW. MIN_CT
srun -n 4 ./stream
Example Results
===============

ATS-3 Rocinante HBM
-------------------

CTS-1 Snow
-----------

.. csv-table:: STREAM microbenchmark bandwidth measurement
:file: stream-cts1_ats5intel-oneapi-openmpi.csv
:file: stream_cts1.csv
:align: center
:widths: 10, 10
:widths: 10, 10, 10
:header-rows: 1

.. figure:: cpu_cts1.png
:align: center
:scale: 50%
:alt: STREAM microbenchmark bandwidth measurement

ATS-3 Rocinante HBM
-------------------

17 changes: 8 additions & 9 deletions doc/sphinx/10_microbenchmarks/M1_STREAM/cpu.gp
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@ set terminal pngcairo enhanced size 1024, 768 dashed font 'Helvetica,18'
set output "cpu_cts1.png"

set title "STREAM Single node bandwidth" font "serif,22"
set xlabel "No. Processing Elements"
set ylabel "Figure of Merit Triad (MB/s)"
set ylabel "Per core triad (MB/s)"
set y2label "FOM: Total Triad (MB/s)"

set xrange [1:64]
set xrange [1:40]
set yrange [3000:15000]

set logscale x 2
set logscale y 2
# set logscale x 2
# set logscale y 2

set grid
show grid
Expand All @@ -21,9 +22,7 @@ set key autotitle columnheader
set style line 1 linetype 6 dashtype 1 linecolor rgb "#FF0000" linewidth 2 pointtype 6 pointsize 3
set style line 2 linetype 1 dashtype 2 linecolor rgb "#FF0000" linewidth 2

plot "stream-cts1_ats5intel-oneapi-openmpi.csv" using 1:2 with linespoints linestyle 1
plot "stream_cts1.csv" using 1:2 with linespoints linestyle 1 axis x1y1, "" using 1:3 with line linestyle 2 axis x1y2


# set output "cpu_133M.png"
# set title "Branson Strong Scaling Performance on CTS-1, 133M particles" font "serif,22"
# plot "cpu_133M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2

8 changes: 8 additions & 0 deletions doc/sphinx/10_microbenchmarks/M1_STREAM/stream_cts1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
No. Cores,Bandwidth (MB/s),Total Bandwidth (MB/s)
1,10690.1,10690.1
2,10701.3,21402.6
4,9316.5,37266.0
8,7884.5,63076.0
16,7747.5,123960.0
32,5510.3,176329.6
36,3189.2,114811.2
12 changes: 10 additions & 2 deletions doc/sphinx/10_microbenchmarks/M3_DGEMM/DGEMM.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Problem
-------

.. math::
\mathbf{C} = \alpha*\mathbf{A}*\mathbf{B} + \beta*\mathbf{C}
Where :math:`A B C` are square :math:`NxN` vectors and :math:`\alpha` and :math:`\beta` are scalars. This operation is repeated :math:`R` times.
Expand All @@ -30,7 +31,6 @@ GFLOP/s rate: <FOM> GF/s
Run Rules
---------


* Vendors are permitted to change the source code in the region marked in the source.
* Optimized BLAS/DGEMM routines are permitted (and encouraged) to demonstrate the highest performance.
* Vendors may modify the Makefile(s) as required
Expand All @@ -40,12 +40,14 @@ Building

Makefiles are provided for the intel and gcc compilers. Before building, load the compiler and blas libraries into the PATH and LD_LIBRARY_PATH.

.. code-block::
.. code-block:: bash
cd src
patch -p1 < ../dgemm_omp_fixes.patch
make
..
If using a different compiler, copy and modify the simple makefiles to apply the appropriate flags.

If using a different blas library than mkl or openblas, modify the C source file to use the correct header and dgemm command.
Expand All @@ -58,12 +60,18 @@ DGEMM uses OpenMP but does not use MPI.
Set the number of OpenMP threads before running.

.. code-block:: bash
export OPENBLAS_NUM_THREADS = <nthreads>
export OMP_NUM_THREADS = <nthreads>
..
.. code-block:: bash
./mt-dgemm <N> <R> <alpha> <beta>
..
These values default to: :math:`N=256, R=8, \alpha=1.0, \beta=1.0`

These inputs are subject to the conditions :math:`N>128, R>4`.
Expand Down
16 changes: 4 additions & 12 deletions doc/sphinx/10_microbenchmarks/M3_DGEMM/cpu.gp
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
#!/usr/bin/gnuplot
set terminal pngcairo enhanced size 1024, 768 dashed font 'Helvetica,18'
set output "cpu_66M.png"
set output "dgemm_cts1.png"

set title "Branson Strong Scaling Performance on CTS-1, 66M particles" font "serif,22"
set title " Single node Dgemm" font "serif,22"
set xlabel "No. Processing Elements"
set ylabel "Figure of Merit (particles/sec)"
set ylabel "Figure of Merit (GFlops)"

set xrange [1:64]
set key left top
Expand All @@ -21,15 +21,7 @@ set key autotitle columnheader
set style line 1 linetype 6 dashtype 1 linecolor rgb "#FF0000" linewidth 2 pointtype 6 pointsize 3
set style line 2 linetype 1 dashtype 2 linecolor rgb "#FF0000" linewidth 2

plot "cpu_66M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2
#plot "cpu_66M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2

set output "cpu_133M.png"
set title "Branson Strong Scaling Performance on CTS-1, 133M particles" font "serif,22"
plot "cpu_133M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2


set output "cpu_200M.png"
set title "Branson Strong Scaling Performance on CTS-1, 200M particles" font "serif,22"
plot "cpu_200M.csv" using 1:2 with linespoints linestyle 1, "" using 1:3 with line linestyle 2


0 comments on commit b38b6b5

Please sign in to comment.