Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize UMT FOM by the number of iterations #103

Merged
merged 1 commit into from
Jul 11, 2024
Merged

Conversation

pearce8
Copy link
Collaborator

@pearce8 pearce8 commented Jul 2, 2024

  • FOM definition in the text
  • Update FOM data from Sierra (@aaroncblack Can you please update this?)
  • Update FOM data from Crossroads (@dmageeLANL Can you please update this?)

@gshipman
Copy link
Collaborator

gshipman commented Jul 2, 2024

@pearce8 I hate to ask this, but does this change in FOM change the results here: https://lanl.github.io/benchmarks/06_umt/umt.html#example-fom-results

Thanks,

Galen

@dmageeLANL
Copy link
Collaborator

I didn't run this initially or have the output so I don't know how many iterations there were to normalize it.

@gshipman
Copy link
Collaborator

gshipman commented Jul 3, 2024

@aaroncblack @pearce8 Can you please provide us with the UMT configs for Rocinante / Crossroads, I believe @aaroncblack or @richards12 ran this on Roci, @dmageeLANL did not run this.

@aaroncblack
Copy link
Collaborator

For roci, I believe @richards12 used an intel compiler build with most likely "-O2" optimization and no other compiler tweaks. That is what I did on my local LLNL intel platform.

In the lanl repo under the umt docs area I see his graph used the data points at 1, 8, 32, 56, 88, and 112 cores for both benchmark runs ( SPP1 and SPP2 problems ).

You'll want to target half the node memory on these ( 128GB per node on roci? So target 64GB memory use). The problem size can be adjusted by changing the size of the mesh with the "-B global -d x, y, z" where x,y,z is the number of mesh tiles in each axis dimension.

I tested locally at LLNL and found these numbers to work the best to get at/around 64GB for the problem.

bash-4.4$ srun -n1 ./install/bin/test_driver -B global -d 14,14,14 -b 1
bash-4.4$ srun -n1 ./install/bin/test_driver -B global -d 31,31,31 -b 2

Change the '-n1' to 1, 8, 32, 56, 88, 112 for the runs.

Between each cycle umt will output a line like:
Teton driver: CPU MEM USE (rank 0): 581.305MB

If you multiply that by the # ranks you should get a rough estimate on total memory usage.

@gshipman
Copy link
Collaborator

gshipman commented Jul 3, 2024

@dmageeLANL Can you run as @aaroncblack describes above? Thx

@richards12
Copy link
Collaborator

richards12 commented Jul 8, 2024 via email

@gshipman
Copy link
Collaborator

gshipman commented Jul 8, 2024

@richards12 It would be helpful to have your scripts to run UMT again in the same way you ran it.
Do you need help getting onto to Roci?

@dmageeLANL
Copy link
Collaborator

@aaroncblack Those instructions look reasonable, I'll give it a shot later today. I'll let you know if I run into any issues. @gshipman @pearce8 @richards12

@richards12
Copy link
Collaborator

richards12 commented Jul 8, 2024 via email

@dmageeLANL
Copy link
Collaborator

I got your package Dave. But I don't really know what it means. I see there's a lot more packages in umt_workspace (metis, mfem, hypre). Does UMT require these? Also, I see that there are results there which means there's a number of iterations. Does this mean we don't need to re run it and the rest of this message is moot?


I've built UMT on roci with conduit with the default environment: PrgEnv intel. I'm using the UMT in the benchmarks repo and the head of the develop branch of conduit (0.9.2). The build went generally smoothly, I built both with cmake. But runtime:

~ srun -N 1 -n 1 ./installs/bin/test_driver -B global -d 14,14,14 -b 1
Teton driver: number of MPI ranks: 1
Teton driver: Running predefined benchmark problem UMT SP#1
Teton driver: Threading enabled, max number of threads is 2
Teton driver: Rebuild with Conduit 0.8.9 or later to use tiled meshes.
srun: error: nid001109: task 0: Exited with exit code 1
srun: Terminating StepId=1412488.11

Which is weird because it's conduit 0.9.2. I tried setting export MPICH_SMP_SINGLE_COPY_MODE=CMA, MPICH_MAX_THREAD_SAFETY=multiple but no dice. There's absolutely no information about the error.

@aaroncblack
Copy link
Collaborator

aaroncblack commented Jul 9, 2024 via email

@gshipman
Copy link
Collaborator

gshipman commented Jul 9, 2024

@dmageeLANL , you said you are using the develop branch, i know some release processes only embed a version number into the build in tagged releases. Maybe UMT is looking for a version number and cant find it cause you have develop.

@dmageeLANL
Copy link
Collaborator

The head of develop is tagged as 0.9.2.

@richards12
Copy link
Collaborator

richards12 commented Jul 9, 2024 via email

@gshipman
Copy link
Collaborator

gshipman commented Jul 9, 2024

@dmageeLANL you mentioned you are using the version of UMT in the GitHub.com/lanl/benchmarks repo? This is 6 months old I think:
https://github.com/LLNL/UMT/tree/ed70b58e77b6dfb29b6b7f01d53bde2a02b7f218
You need a relatively new co of UMT to get the changes in FOM I believe.
Here is where it the message is coming from in that version, it isn't in newer versions of UMT.
https://github.com/LLNL/UMT/blob/ed70b58e77b6dfb29b6b7f01d53bde2a02b7f218/src/teton/driver/test_driver.cc#L1844

@gshipman
Copy link
Collaborator

gshipman commented Jul 9, 2024

@dmageeLANL

I verified that I can build and run on Roci using latest UMT and Conduit.

(base) gshipman@nid001234:/usr/projects/eap/users/gshipman/benchmarks/UMT/install-ro/bin> srun -n1 ./test_driver -B global -d 31,31,31 -b 2
Teton driver: number of MPI ranks: 1
Teton driver: Running predefined benchmark problem UMT SP#2
Detected UMT run, fixing temperature iterations to one and increasing max flux iterations to enable convergence.
Teton driver: Using older GTA kernel, version 1.
Teton: setting verbosity to 1
=================================================================
=================================================================
Test driver starting time steps
=================================================================
Solving for 2928574464 global unknowns.
(5719872 spatial elements * 32 directions (angles) * 16 energy groups)
CPU memory needed per rank (average) for radiation intensity (PSI): 22343.2MB
Current CPU memory use (rank 0): 43555.1MB
Iteration control: relative tolerance set to 1e-07.
=================================================================

 
 >>>>>>>>>>>>>>>     End of Radiation Step Report    <<<<<<<<<<<<<<<
 TIME STEP        1  timerad =       0.0010000000  dtrad =   1.0000000000E-03
 
 FluxIters =            3
 TrMax =       0.0479810101 in Zone  238624 on Process     0
 TeMax =       0.5000000000 in Zone     686 on Process     0
 Energy deposited in material =    0.0000000000E+00 ERad total =    5.5683591379E-08 Energy check =  -4.1994305338E-20
 Recommended time step for next rad cycle =   5.0000000000E-04
 
 *****************     Run Time     *****************
                     Cycle (min)     Accumulated (min)
 RADTR          =     2.72014894         2.72014894
 Sweep(CPU)     =     2.42665883         2.42665883
 Sweep(GPU)     =     0.00000000         0.00000000
 Initialization =     0.27952584         0.27952584
 Finalization   =     0.00678847         0.00678847
  
 *****************   Convergence    *****************
     Controlled by =  Intensity 
     ProcessID     =       0
     Zone          =       1
     Rel Error     =  0.00000000000E+00
     Tr            =  3.13271659561E-02
     Te            =  5.00000000000E-01
     Rho           =  1.31000000000E+00
     Cv            =  5.01000000000E-01
     Source Rate   =  0.00000000000E+00
     Coordinates   =  2.4194E-03  2.4194E-03  1.6129E-02
  
 *****************  Time Step Vote  *****************
     For Cycle     =       2
     Controlled by =  Rad Energy Density
     ProcessID     =       0
     Control Zone  =  407680
     Recommend Dt  =  5.00000000000E-04
     Max Change    =  8.45899370175E-01
     Tr            =  3.13271659561E-02
     Tr Old        =  5.00000000000E-02
     Te            =  5.00000000000E-01
     Te Old        =  5.00000000000E-01
     Rho           =  1.31000000000E+00
     Cv            =  5.01000000000E-01
     Source Rate   =  0.00000000000E+00
     Coordinates   =  9.9758E-01  2.4194E-03  1.6129E-02
  
Teton driver: CPU MEM USE (rank 0): 44254.1MB

@dmageeLANL
Copy link
Collaborator

I got it running. Sorry for the confusion, I hadn't noticed that the version of UMT in this repository was older. I used the newest UMT and it worked!

@gshipman
Copy link
Collaborator

gshipman commented Jul 9, 2024

Sweet! Once you have the performance numbers, please update the csv files for the plots and tables and such in the GitHub pages documentation as well.

@dmageeLANL
Copy link
Collaborator

Ok I have results, but I'm not sure which number is the operative one. Here's the full result csv (do the results look reasonable?):

Problem,nprocs,iterations,memory,wall_time,single_throughput,total_throughput
1,1,15,52276.3,581.864,1.25169e+08,4.17231e+07
2,1,15,48315.3,724.603,6.06244e+07,2.02081e+07
1,8,22,7473.68,100.527,1.06259e+09,2.41498e+08
2,8,24,7140.54,158.118,4.44515e+08,9.26073e+07
1,32,33,1937.47,49.6981,3.22405e+09,4.88492e+08
2,32,33,1625.43,57.3876,1.68404e+09,2.55157e+08
1,56,42,1020.18,38.5321,5.29242e+09,6.30051e+08
2,56,41,1045.1,48.617,2.46974e+09,3.01188e+08
1,88,49,760.32,42.1732,5.6414e+09,5.75653e+08
2,88,47,661.52,35.6454,3.86146e+09,4.10793e+08
1,112,46,530.523,28.5231,7.8305e+09,8.51141e+08
2,112,46,559.891,31.278,4.30701e+09,4.68153e+08

The numbers come from this part of the output, this is from procs=1 problem=1:

Teton driver: CPU MEM USE (rank 0): 52276.3MB

=================================================================
=================================================================
Test driver finished time steps
=================================================================
Average throughput of single iteration of iterative solver was 1.25169e+08 unknowns calculated per second.
Throughput of iterative solver was 4.17231e+07 unknowns calculated per second.
(average throughput of single iteration * # iterations for solver to produce answer

Total number of flux solver iterations for run: 15
Total wall time for run: 581.864 seconds.
=================================================================

I just want to make sure I'm looking at the right numbers and running this correctly before I make any changes.

Thanks!

Copy link
Collaborator

@gshipman gshipman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@gshipman gshipman marked this pull request as ready for review July 11, 2024 20:58
@gshipman gshipman merged commit 6b1fe16 into main Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants