diff --git a/doc/sphinx/00_intro/introduction.rst b/doc/sphinx/00_intro/introduction.rst index c4e3cb17..8b038bc4 100644 --- a/doc/sphinx/00_intro/introduction.rst +++ b/doc/sphinx/00_intro/introduction.rst @@ -180,6 +180,12 @@ Single node benchmarks will require respondent to provide estimates on * Problem size must be changed to meet % of memory requirements. +* Respondent shall provide CPU strong scaling and GPU throughput results on current generation representative architectures. + If no representative architecture exists respondent can provide modeled / projected CPU strong scaling and GPU throughput results. + respondent may provide both results on current generation representative architectures and modeled / projected architectures. + +* For SSNI projections respondent shall use the specific problem size(s) specified for SSNI. + Source code modification categories: * Baseline: “out-of-the-box” performance @@ -232,6 +238,53 @@ Where: * w = weighting factor. + +.. _GlobalSSNIWeightsSizes: + +SSNI Weights and SSNI problem sizes +=================================== + + +.. list-table:: + + * - **SSNI Benchmark** + - **SSNI Weight** + - **SSNI Problem size - % device memory** + * - Branson + - TBD + - 30 + * - AMG2023 Problem 1 Setup + - TBD + - 20 + * - AMG2023 Problem 2 Setup + - TBD + - 20 + * - AMG2023 Problem 1 Solve + - TBD + - 20 + * - AMG2023 Problem 2 Solve + - TBD + - 20 + * - MiniEM + - TBD + - TBD + * - MLMD Training + - TBD + - N/A + * - MLMD Simulation + - TBD + - 60 + * - Parthenon-VIBE + - TBD + - 40 + * - Sparta + - TBD + - TBD + * - UMT + - TBD + - TBD + + System Information ================== diff --git a/doc/sphinx/01_branson/branson.rst b/doc/sphinx/01_branson/branson.rst index f7cbe032..286ae590 100644 --- a/doc/sphinx/01_branson/branson.rst +++ b/doc/sphinx/01_branson/branson.rst @@ -28,9 +28,25 @@ It is in replicated mode which means there is very little MPI communication (end Figure of Merit --------------- -The Figure of Merit is defined as particles/second and is obtained by dividing the number of particles in the problem divided by the `Total transport` value in the output. Future versions will output this number directly. +The Figure of Merit is defined as particles/second and is obtained by dividing the number of particles in the problem divided by the `Total transport` value. +This value is labeled "Photons Per Second (FOM):" in Branson's output. +Problem Sizes +------------- +For strong scaling on a CPU, Branson must be run with three different problem sizes such that the memory +footprint of all Branson processes at the smallest process count per node is approximately: 4 to 5%, 8 to 10%, and 20 to 22%; during step 2 of the simulation. + + +For throughput curves on a GPU the memory footprint of Branson must vary between ~5% and ~80% in increments of at most 5% of the computational device's main memory. + +The memory footprint can be controlled by editing "photons" in the input file. + +Results of both CPU strong scaling and GPU throughput should be provided on a representative, current-generation hardware configuration used in benchmarking and projections. +Results which are + +See (see :ref:`GlobalSSNIWeightsSizes`) for the problem size for SSNI projection. + Building ======== @@ -104,8 +120,7 @@ It is run with: .. -For strong scaling on a CPU, Branson should be run with three different problem sizes such that the memory -footprint at the smallest process count per node is approximately: 4 to 5%, 8 to 10%, and 20 to 22%; during step 2 of the simulation. + Memory footprint is the sum of all Branson processes resident set size (or equivalent) on the node. This can be obtained on a CPU system using the following (while the application is in step 2): @@ -116,8 +131,7 @@ This can be obtained on a CPU system using the following (while the application ps -C BRANSON -o rss | awk '{sum+=$1;} END{print sum/1024/1024;}' .. -For throughput curves on a GPU the memory footprint of Branson must vary between ~5% and ~60% in increments of at most 5% of the computational device's main memory. -The memory footprint can be controlled by editing "photons" in the input file. + Results from Branson are provided on the following systems: @@ -128,17 +142,19 @@ Results from Branson are provided on the following systems: .. _DarwinA100: +AMD Epyc + Nvidia A100 +---------------------- + Dual socket AMD Epyc 7502 with 32 cores operating at 2.5 GHz with 256 GBytes CPU memory and dual Nvidia Ampere A100-SXM4 GPUs with 40GBytes of memory per GPU. - Correctness ------------ Branson has two main checks on correctness. The first is a looser check that's meant as a "smoke test" to see if a code change has introduced an error. After every timestep, a summary block is -printed sdlfdjskl: +printed: .. code-block:: bash @@ -181,6 +197,7 @@ The second check on correctness is much simpler. For any changes to Branson, the the same temperature in a standard marshak wave problem after 100 cycles. For the `marshak wave input `_ file, the following temperature profile should be reproduced to 3% after 100 cycles, as shown below: .. code-block:: bash + Step: 100 Start Time: 0.99 End Time: 1 dt: 0.01 source time: 0.094371 -------- VERBOSE PRINT BLOCK: CELL TEMPERATURE -------- @@ -211,7 +228,7 @@ the same temperature in a standard marshak wave problem after 100 cycles. For th 23 0.010000237 0.0099765577 2.3568109e-07 24 0.010000281 0.0099765314 2.3568212e-07 ------------------------------------------------------- - +.. This output is expected as long as the spatial, boundary and region blocks are kept the same in the @@ -256,8 +273,7 @@ figure. Branson Strong Scaling Performance on Crossroads 66M particles -Strong scaling performance of Branson Crossroads 200M Particles is provided within the following table and -figure. +Strong scaling performance of Branson Crossroads 200M Particles is provided within the following table and figure. .. csv-table:: Branson Strong Scaling Performance on Crossroads 200M particles :file: cpu_200M.csv @@ -272,24 +288,24 @@ figure. Branson Strong Scaling Performance on Crossroads 200M particles -AMD Epyc + Nvidia A100 ------------- +AMD Epyc + Nvidia A100 +---------------------- Throughput performance of Branson on AMD Epyc + Nvidia A100 (using a single GPU) is provided within the following table and figure. -.. csv-table::Branson Throughput Performance on AMD Epyc + A100 +.. csv-table:: Branson Throughput Performance on AMD Epyc + Nvidia A100 :file: gpu.csv :align: center - :widths: 10, 10 + :widths: 15, 15 :header-rows: 1 .. figure:: gpu.png :align: center :scale: 50% - :alt: Branson Throughput Performance on AMD Epyc + A100 + :alt: Branson Throughput Performance on AMD Epyc + Nvidia A100 - Branson Throughput Performance on AMD Epyc + A100 + Branson Throughput Performance on AMD Epyc + Nvidia A100 References ========== diff --git a/doc/sphinx/01_branson/gpu.csv b/doc/sphinx/01_branson/gpu.csv index ff4a1eb4..aef034a9 100644 --- a/doc/sphinx/01_branson/gpu.csv +++ b/doc/sphinx/01_branson/gpu.csv @@ -1,22 +1,22 @@ -No. Particles,Actual -100000,2.33E+05 -200000,4.32E+05 -300000,5.55E+05 -400000,6.52E+05 -500000,7.14E+05 -600000,7.84E+05 -700000,8.17E+05 -800000,8.40E+05 -900000,8.81E+05 -1000000,9.06E+05 -2000000,9.51E+05 -3000000,8.72E+05 -4000000,8.38E+05 -5000000,7.92E+05 -6600000,7.39E+05 -10000000,6.34E+05 -13300000,5.76E+05 -20000000,5.03E+05 -50000000,3.54E+05 -100000000,2.74E+05 -200000000,2.23E+05 +No. Particles, Actual +100000, 2.33E+05 +200000, 4.32E+05 +300000, 5.55E+05 +400000, 6.52E+05 +500000, 7.14E+05 +600000, 7.84E+05 +700000, 8.17E+05 +800000, 8.40E+05 +900000, 8.81E+05 +1000000, 9.06E+05 +2000000, 9.51E+05 +3000000, 8.72E+05 +4000000, 8.38E+05 +5000000, 7.92E+05 +6600000, 7.39E+05 +10000000, 6.34E+05 +13300000, 5.76E+05 +20000000, 5.03E+05 +50000000, 3.54E+05 +100000000, 2.74E+05 +200000000, 2.23E+05 diff --git a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst index 1af00b4d..4a44ee8d 100644 --- a/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst +++ b/doc/sphinx/09_Microbenchmarks/M1_STREAM/STREAM.rst @@ -98,6 +98,7 @@ At capacity, the measured values should reach a steady state where increasing th For Crossroads, the benchmark was build with ``STREAM_ARRAY_SIZE=40000000`` and ``NTIMES=20`` with optmizations and OpenMP enabled. .. code-block:: bash + make CC=`which mpicc` FF=`which mpifort` CFLAGS="-O2 -fopenmp -DSTREAM_ARRAY_SIZE=40000000 -DNTIMES=20" FFLAGS="-O2 -fopenmp -DSTREAM_ARRAY_SIZE=40000000 -DNTIMES=20"