Normalize UMT FOM by the number of iterations #103

pearce8 · 2024-07-02T02:02:54Z

FOM definition in the text
Update FOM data from Sierra (@aaroncblack Can you please update this?)
Update FOM data from Crossroads (@dmageeLANL Can you please update this?)

gshipman · 2024-07-02T16:06:33Z

@pearce8 I hate to ask this, but does this change in FOM change the results here: https://lanl.github.io/benchmarks/06_umt/umt.html#example-fom-results

Thanks,

Galen

dmageeLANL · 2024-07-02T23:31:13Z

I didn't run this initially or have the output so I don't know how many iterations there were to normalize it.

gshipman · 2024-07-03T17:49:17Z

@aaroncblack @pearce8 Can you please provide us with the UMT configs for Rocinante / Crossroads, I believe @aaroncblack or @richards12 ran this on Roci, @dmageeLANL did not run this.

aaroncblack · 2024-07-03T19:32:04Z

For roci, I believe @richards12 used an intel compiler build with most likely "-O2" optimization and no other compiler tweaks. That is what I did on my local LLNL intel platform.

In the lanl repo under the umt docs area I see his graph used the data points at 1, 8, 32, 56, 88, and 112 cores for both benchmark runs ( SPP1 and SPP2 problems ).

You'll want to target half the node memory on these ( 128GB per node on roci? So target 64GB memory use). The problem size can be adjusted by changing the size of the mesh with the "-B global -d x, y, z" where x,y,z is the number of mesh tiles in each axis dimension.

I tested locally at LLNL and found these numbers to work the best to get at/around 64GB for the problem.

bash-4.4$ srun -n1 ./install/bin/test_driver -B global -d 14,14,14 -b 1
bash-4.4$ srun -n1 ./install/bin/test_driver -B global -d 31,31,31 -b 2

Change the '-n1' to 1, 8, 32, 56, 88, 112 for the runs.

Between each cycle umt will output a line like:
Teton driver: CPU MEM USE (rank 0): 581.305MB

If you multiply that by the # ranks you should get a rough estimate on total memory usage.

gshipman · 2024-07-03T20:17:19Z

@dmageeLANL Can you run as @aaroncblack describes above? Thx

richards12 · 2024-07-08T13:51:57Z

Let me know if you need more information about my runs and I will try to dig out the information. Someone might have to remind me how to connect to roci. Dave

…

------------------- David Richards Center for Applied Scientific Computing Lawrence Livermore National Laboratory From: Galen Shipman ***@***.***> Date: Wednesday, July 3, 2024 at 1:17 PM To: lanl/benchmarks ***@***.***> Cc: Richards, David ***@***.***>, Mention ***@***.***> Subject: Re: [lanl/benchmarks] Normalize UMT FOM by the number of iterations (PR #103) @dmageeLANL<https://urldefense.us/v3/__https:/github.com/dmageeLANL__;!!G2kpM7uM-TzIFchu!y8iOpv_CRCZyG8KLQNPOVhW3jWt_DMLG3v1mFBfmaxDNGeDIiQrcYwRjc0o862Z9z-Fm9Cr_yJWIH_jTcglbAVS1Q5O4$> Can you run as @aaroncblack<https://urldefense.us/v3/__https:/github.com/aaroncblack__;!!G2kpM7uM-TzIFchu!y8iOpv_CRCZyG8KLQNPOVhW3jWt_DMLG3v1mFBfmaxDNGeDIiQrcYwRjc0o862Z9z-Fm9Cr_yJWIH_jTcglbAc4qvxSK$> describes above? Thx — Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https:/github.com/lanl/benchmarks/pull/103*issuecomment-2207169218__;Iw!!G2kpM7uM-TzIFchu!y8iOpv_CRCZyG8KLQNPOVhW3jWt_DMLG3v1mFBfmaxDNGeDIiQrcYwRjc0o862Z9z-Fm9Cr_yJWIH_jTcglbAUvGwgc2$>, or unsubscribe<https://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AATGEVDRI6WB62RREHL2KZDZKRL6LAVCNFSM6AAAAABKGS26D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGE3DSMRRHA__;!!G2kpM7uM-TzIFchu!y8iOpv_CRCZyG8KLQNPOVhW3jWt_DMLG3v1mFBfmaxDNGeDIiQrcYwRjc0o862Z9z-Fm9Cr_yJWIH_jTcglbAcrW4Thk$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

gshipman · 2024-07-08T15:42:36Z

@richards12 It would be helpful to have your scripts to run UMT again in the same way you ran it.
Do you need help getting onto to Roci?

dmageeLANL · 2024-07-08T16:40:15Z

@aaroncblack Those instructions look reasonable, I'll give it a shot later today. I'll let you know if I run into any issues. @gshipman @pearce8 @richards12

richards12 · 2024-07-08T18:16:28Z

I looked in my LLNL accounts and found what appears to be a tar file with all of the “stuff” from my roci runs. It’s too big to attach to an email so I will share a link to it with Daniel, Galen, and Aaron. Let me know if anyone else needs it. I’ll also be happy to schedule a time to look through the contents and try to figure out what I was doing if it isn’t clear. https://doellnl-my.sharepoint.com/:u:/g/personal/richards12_llnl_gov/ETn1BOiObkxOi01Msa0GyvMB-Pl65VXni4fQNRFJp9ADEw?e=PhNWoF Dave

…

------------------- David Richards Center for Applied Scientific Computing Lawrence Livermore National Laboratory From: Daniel J Magee ***@***.***> Date: Monday, July 8, 2024 at 9:40 AM To: lanl/benchmarks ***@***.***> Cc: Richards, David ***@***.***>, Mention ***@***.***> Subject: Re: [lanl/benchmarks] Normalize UMT FOM by the number of iterations (PR #103) @aaroncblack<https://urldefense.us/v3/__https:/github.com/aaroncblack__;!!G2kpM7uM-TzIFchu!0eEX5jt6Mj-8v3Ck1xcGBqAb8l_NG1IGrbXg86iUsofrJg9eF2jQvyF9QPgDzf-kVK4VWUM_R1Fdg4Loonhv--4m8tPC$> Those instructions look reasonable, I'll give it a shot later today. I'll let you know if I run into any issues. @gshipman<https://urldefense.us/v3/__https:/github.com/gshipman__;!!G2kpM7uM-TzIFchu!0eEX5jt6Mj-8v3Ck1xcGBqAb8l_NG1IGrbXg86iUsofrJg9eF2jQvyF9QPgDzf-kVK4VWUM_R1Fdg4Loonhv-7bC9UGG$> @pearce8<https://urldefense.us/v3/__https:/github.com/pearce8__;!!G2kpM7uM-TzIFchu!0eEX5jt6Mj-8v3Ck1xcGBqAb8l_NG1IGrbXg86iUsofrJg9eF2jQvyF9QPgDzf-kVK4VWUM_R1Fdg4Loonhv-4ODCtWA$> @richards12<https://urldefense.us/v3/__https:/github.com/richards12__;!!G2kpM7uM-TzIFchu!0eEX5jt6Mj-8v3Ck1xcGBqAb8l_NG1IGrbXg86iUsofrJg9eF2jQvyF9QPgDzf-kVK4VWUM_R1Fdg4Loonhv-6AXfAy1$> — Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https:/github.com/lanl/benchmarks/pull/103*issuecomment-2214661601__;Iw!!G2kpM7uM-TzIFchu!0eEX5jt6Mj-8v3Ck1xcGBqAb8l_NG1IGrbXg86iUsofrJg9eF2jQvyF9QPgDzf-kVK4VWUM_R1Fdg4Loonhv-yZjUDx0$>, or unsubscribe<https://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AATGEVAI23QGZ3CRJTCIYALZLK6ILAVCNFSM6AAAAABKGS26D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJUGY3DCNRQGE__;!!G2kpM7uM-TzIFchu!0eEX5jt6Mj-8v3Ck1xcGBqAb8l_NG1IGrbXg86iUsofrJg9eF2jQvyF9QPgDzf-kVK4VWUM_R1Fdg4Loonhv--Hz6wQq$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

dmageeLANL · 2024-07-08T23:48:46Z

I got your package Dave. But I don't really know what it means. I see there's a lot more packages in umt_workspace (metis, mfem, hypre). Does UMT require these? Also, I see that there are results there which means there's a number of iterations. Does this mean we don't need to re run it and the rest of this message is moot?

I've built UMT on roci with conduit with the default environment: PrgEnv intel. I'm using the UMT in the benchmarks repo and the head of the develop branch of conduit (0.9.2). The build went generally smoothly, I built both with cmake. But runtime:

~ srun -N 1 -n 1 ./installs/bin/test_driver -B global -d 14,14,14 -b 1
Teton driver: number of MPI ranks: 1
Teton driver: Running predefined benchmark problem UMT SP#1
Teton driver: Threading enabled, max number of threads is 2
Teton driver: Rebuild with Conduit 0.8.9 or later to use tiled meshes.
srun: error: nid001109: task 0: Exited with exit code 1
srun: Terminating StepId=1412488.11

Which is weird because it's conduit 0.9.2. I tried setting export MPICH_SMP_SINGLE_COPY_MODE=CMA, MPICH_MAX_THREAD_SAFETY=multiple but no dice. There's absolutely no information about the error.

aaroncblack · 2024-07-09T00:06:08Z

An older version of UMT required MFEM, but now we only need conduit. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Daniel J Magee ***@***.***> Sent: Monday, July 8, 2024 4:49:08 PM To: lanl/benchmarks ***@***.***> Cc: Black, Aaron C. ***@***.***>; Mention ***@***.***> Subject: Re: [lanl/benchmarks] Normalize UMT FOM by the number of iterations (PR #103) I got your package Dave. But I don't really know what it means. I see there's a lot more packages in umt_workspace (metis, mfem, hypre). Does UMT require these? Also, I see that there are results there which means there's a number of iterations. Does this mean we don't need to re run it and the rest of this message is moot?

________________________________ I've built UMT on roci with conduit with the default environment: PrgEnv intel. I'm using the UMT in the benchmarks repo and the head of the develop branch of conduit (0.9.2). The build went generally smoothly, I built both with cmake. But runtime: ~ srun -N 1 -n 1 ./installs/bin/test_driver -B global -d 14,14,14 -b 1 Teton driver: number of MPI ranks: 1 Teton driver: Running predefined benchmark problem UMT SP#1 Teton driver: Threading enabled, max number of threads is 2 Teton driver: Rebuild with Conduit 0.8.9 or later to use tiled meshes. srun: error: nid001109: task 0: Exited with exit code 1 srun: Terminating StepId=1412488.11 Which is weird because it's conduit 0.9.2. I tried setting export MPICH_SMP_SINGLE_COPY_MODE=CMA, MPICH_MAX_THREAD_SAFETY=multiple but no dice. There's absolutely no information about the error. — Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https://github.com/lanl/benchmarks/pull/103*issuecomment-2215552733__;Iw!!G2kpM7uM-TzIFchu!y0DBmvipjH7hyCC0dC_LZf90GFKeDR_iLEv7P7FR_8Qv7PCXNnh3I5IQHY_nfjpldot-b62kHDqqynno_hqM1U5B2Co$>, or unsubscribe<https://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AELLW46C3BUREK36JGPWWA3ZLMQPJAVCNFSM6AAAAABKGS26D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJVGU2TENZTGM__;!!G2kpM7uM-TzIFchu!y0DBmvipjH7hyCC0dC_LZf90GFKeDR_iLEv7P7FR_8Qv7PCXNnh3I5IQHY_nfjpldot-b62kHDqqynno_hqMhGa0TRE$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

gshipman · 2024-07-09T00:17:17Z

@dmageeLANL , you said you are using the develop branch, i know some release processes only embed a version number into the build in tagged releases. Maybe UMT is looking for a version number and cant find it cause you have develop.

dmageeLANL · 2024-07-09T00:21:16Z

The head of develop is tagged as 0.9.2.

richards12 · 2024-07-09T00:28:30Z

Daniel, As Aaron mentioned, back when I did these runs UMT had more dependencies. Now that I think about it, that version of UMT also had a different input format and problem description. So I’m not sure how relevant any of the trials I did will be to the current version which has a significantly different problem definition. Probably the most insight that you can find in the files is the input scripts to give you a sense of how I did the testing for different problem sizes. For different problem sizes (each R is a different problem size) I was running scaling across different numbers of MPI ranks. It looks like I also did multiple trials to check reproducibility of results. Dave

…

------------------- David Richards Center for Applied Scientific Computing Lawrence Livermore National Laboratory From: Daniel J Magee ***@***.***> Date: Monday, July 8, 2024 at 4:49 PM To: lanl/benchmarks ***@***.***> Cc: Richards, David ***@***.***>, Mention ***@***.***> Subject: Re: [lanl/benchmarks] Normalize UMT FOM by the number of iterations (PR #103) I got your package Dave. But I don't really know what it means. I see there's a lot more packages in umt_workspace (metis, mfem, hypre). Does UMT require these? Also, I see that there are results there which means there's a number of iterations. Does this mean we don't need to re run it and the rest of this message is moot?

________________________________ I've built UMT on roci with conduit with the default environment: PrgEnv intel. I'm using the UMT in the benchmarks repo and the head of the develop branch of conduit (0.9.2). The build went generally smoothly, I built both with cmake. But runtime: ~ srun -N 1 -n 1 ./installs/bin/test_driver -B global -d 14,14,14 -b 1 Teton driver: number of MPI ranks: 1 Teton driver: Running predefined benchmark problem UMT SP#1 Teton driver: Threading enabled, max number of threads is 2 Teton driver: Rebuild with Conduit 0.8.9 or later to use tiled meshes. srun: error: nid001109: task 0: Exited with exit code 1 srun: Terminating StepId=1412488.11 Which is weird because it's conduit 0.9.2. I tried setting export MPICH_SMP_SINGLE_COPY_MODE=CMA, MPICH_MAX_THREAD_SAFETY=multiple but no dice. There's absolutely no information about the error. — Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https:/github.com/lanl/benchmarks/pull/103*issuecomment-2215552733__;Iw!!G2kpM7uM-TzIFchu!3qAgTW8heb1cocEc_iF16VMORsS_2jIFhEDd06FKHdEo0r52tR27o5H8xK55RZzeMvLIYdYncnE-WLoak_3NMHuxzNRN$>, or unsubscribe<https://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AATGEVHYTS7CE6MFYV7ZWRDZLMQPJAVCNFSM6AAAAABKGS26D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJVGU2TENZTGM__;!!G2kpM7uM-TzIFchu!3qAgTW8heb1cocEc_iF16VMORsS_2jIFhEDd06FKHdEo0r52tR27o5H8xK55RZzeMvLIYdYncnE-WLoak_3NMN79jzto$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

gshipman · 2024-07-09T01:42:01Z

@dmageeLANL you mentioned you are using the version of UMT in the GitHub.com/lanl/benchmarks repo? This is 6 months old I think:
https://github.com/LLNL/UMT/tree/ed70b58e77b6dfb29b6b7f01d53bde2a02b7f218
You need a relatively new co of UMT to get the changes in FOM I believe.
Here is where it the message is coming from in that version, it isn't in newer versions of UMT.
https://github.com/LLNL/UMT/blob/ed70b58e77b6dfb29b6b7f01d53bde2a02b7f218/src/teton/driver/test_driver.cc#L1844

gshipman · 2024-07-09T15:47:17Z

@dmageeLANL

I verified that I can build and run on Roci using latest UMT and Conduit.

(base) gshipman@nid001234:/usr/projects/eap/users/gshipman/benchmarks/UMT/install-ro/bin> srun -n1 ./test_driver -B global -d 31,31,31 -b 2
Teton driver: number of MPI ranks: 1
Teton driver: Running predefined benchmark problem UMT SP#2
Detected UMT run, fixing temperature iterations to one and increasing max flux iterations to enable convergence.
Teton driver: Using older GTA kernel, version 1.
Teton: setting verbosity to 1
=================================================================
=================================================================
Test driver starting time steps
=================================================================
Solving for 2928574464 global unknowns.
(5719872 spatial elements * 32 directions (angles) * 16 energy groups)
CPU memory needed per rank (average) for radiation intensity (PSI): 22343.2MB
Current CPU memory use (rank 0): 43555.1MB
Iteration control: relative tolerance set to 1e-07.
=================================================================

 
 >>>>>>>>>>>>>>>     End of Radiation Step Report    <<<<<<<<<<<<<<<
 TIME STEP        1  timerad =       0.0010000000  dtrad =   1.0000000000E-03
 
 FluxIters =            3
 TrMax =       0.0479810101 in Zone  238624 on Process     0
 TeMax =       0.5000000000 in Zone     686 on Process     0
 Energy deposited in material =    0.0000000000E+00 ERad total =    5.5683591379E-08 Energy check =  -4.1994305338E-20
 Recommended time step for next rad cycle =   5.0000000000E-04
 
 *****************     Run Time     *****************
                     Cycle (min)     Accumulated (min)
 RADTR          =     2.72014894         2.72014894
 Sweep(CPU)     =     2.42665883         2.42665883
 Sweep(GPU)     =     0.00000000         0.00000000
 Initialization =     0.27952584         0.27952584
 Finalization   =     0.00678847         0.00678847
  
 *****************   Convergence    *****************
     Controlled by =  Intensity 
     ProcessID     =       0
     Zone          =       1
     Rel Error     =  0.00000000000E+00
     Tr            =  3.13271659561E-02
     Te            =  5.00000000000E-01
     Rho           =  1.31000000000E+00
     Cv            =  5.01000000000E-01
     Source Rate   =  0.00000000000E+00
     Coordinates   =  2.4194E-03  2.4194E-03  1.6129E-02
  
 *****************  Time Step Vote  *****************
     For Cycle     =       2
     Controlled by =  Rad Energy Density
     ProcessID     =       0
     Control Zone  =  407680
     Recommend Dt  =  5.00000000000E-04
     Max Change    =  8.45899370175E-01
     Tr            =  3.13271659561E-02
     Tr Old        =  5.00000000000E-02
     Te            =  5.00000000000E-01
     Te Old        =  5.00000000000E-01
     Rho           =  1.31000000000E+00
     Cv            =  5.01000000000E-01
     Source Rate   =  0.00000000000E+00
     Coordinates   =  9.9758E-01  2.4194E-03  1.6129E-02
  
Teton driver: CPU MEM USE (rank 0): 44254.1MB

dmageeLANL · 2024-07-09T16:55:43Z

I got it running. Sorry for the confusion, I hadn't noticed that the version of UMT in this repository was older. I used the newest UMT and it worked!

gshipman · 2024-07-09T17:03:29Z

Sweet! Once you have the performance numbers, please update the csv files for the plots and tables and such in the GitHub pages documentation as well.

dmageeLANL · 2024-07-09T21:12:51Z

Ok I have results, but I'm not sure which number is the operative one. Here's the full result csv (do the results look reasonable?):

Problem,nprocs,iterations,memory,wall_time,single_throughput,total_throughput
1,1,15,52276.3,581.864,1.25169e+08,4.17231e+07
2,1,15,48315.3,724.603,6.06244e+07,2.02081e+07
1,8,22,7473.68,100.527,1.06259e+09,2.41498e+08
2,8,24,7140.54,158.118,4.44515e+08,9.26073e+07
1,32,33,1937.47,49.6981,3.22405e+09,4.88492e+08
2,32,33,1625.43,57.3876,1.68404e+09,2.55157e+08
1,56,42,1020.18,38.5321,5.29242e+09,6.30051e+08
2,56,41,1045.1,48.617,2.46974e+09,3.01188e+08
1,88,49,760.32,42.1732,5.6414e+09,5.75653e+08
2,88,47,661.52,35.6454,3.86146e+09,4.10793e+08
1,112,46,530.523,28.5231,7.8305e+09,8.51141e+08
2,112,46,559.891,31.278,4.30701e+09,4.68153e+08

The numbers come from this part of the output, this is from procs=1 problem=1:

Teton driver: CPU MEM USE (rank 0): 52276.3MB

=================================================================
=================================================================
Test driver finished time steps
=================================================================
Average throughput of single iteration of iterative solver was 1.25169e+08 unknowns calculated per second.
Throughput of iterative solver was 4.17231e+07 unknowns calculated per second.
(average throughput of single iteration * # iterations for solver to produce answer

Total number of flux solver iterations for run: 15
Total wall time for run: 581.864 seconds.
=================================================================

I just want to make sure I'm looking at the right numbers and running this correctly before I make any changes.

Thanks!

gshipman

Done!

Update umt.rst

1eca1fb

pearce8 requested a review from gshipman July 2, 2024 02:02

pearce8 assigned aaroncblack, pearce8 and dmageeLANL Jul 2, 2024

gshipman approved these changes Jul 11, 2024

View reviewed changes

gshipman marked this pull request as ready for review July 11, 2024 20:58

gshipman merged commit 6b1fe16 into main Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize UMT FOM by the number of iterations #103

Normalize UMT FOM by the number of iterations #103

pearce8 commented Jul 2, 2024 •

edited by gshipman

Loading

gshipman commented Jul 2, 2024

dmageeLANL commented Jul 2, 2024

gshipman commented Jul 3, 2024

aaroncblack commented Jul 3, 2024

gshipman commented Jul 3, 2024

richards12 commented Jul 8, 2024 via email

gshipman commented Jul 8, 2024

dmageeLANL commented Jul 8, 2024

richards12 commented Jul 8, 2024 via email

dmageeLANL commented Jul 8, 2024

aaroncblack commented Jul 9, 2024 via email

gshipman commented Jul 9, 2024

dmageeLANL commented Jul 9, 2024

richards12 commented Jul 9, 2024 via email

gshipman commented Jul 9, 2024

gshipman commented Jul 9, 2024

dmageeLANL commented Jul 9, 2024

gshipman commented Jul 9, 2024

dmageeLANL commented Jul 9, 2024

gshipman left a comment

Normalize UMT FOM by the number of iterations #103

Normalize UMT FOM by the number of iterations #103

Conversation

pearce8 commented Jul 2, 2024 • edited by gshipman Loading

gshipman commented Jul 2, 2024

dmageeLANL commented Jul 2, 2024

gshipman commented Jul 3, 2024

aaroncblack commented Jul 3, 2024

gshipman commented Jul 3, 2024

richards12 commented Jul 8, 2024 via email

gshipman commented Jul 8, 2024

dmageeLANL commented Jul 8, 2024

richards12 commented Jul 8, 2024 via email

dmageeLANL commented Jul 8, 2024

aaroncblack commented Jul 9, 2024 via email

gshipman commented Jul 9, 2024

dmageeLANL commented Jul 9, 2024

richards12 commented Jul 9, 2024 via email

gshipman commented Jul 9, 2024

gshipman commented Jul 9, 2024

dmageeLANL commented Jul 9, 2024

gshipman commented Jul 9, 2024

dmageeLANL commented Jul 9, 2024

gshipman left a comment

Choose a reason for hiding this comment

pearce8 commented Jul 2, 2024 •

edited by gshipman

Loading