-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize UMT FOM by the number of iterations #103
Conversation
pearce8
commented
Jul 2, 2024
•
edited by gshipman
Loading
edited by gshipman
- FOM definition in the text
- Update FOM data from Sierra (@aaroncblack Can you please update this?)
- Update FOM data from Crossroads (@dmageeLANL Can you please update this?)
@pearce8 I hate to ask this, but does this change in FOM change the results here: https://lanl.github.io/benchmarks/06_umt/umt.html#example-fom-results Thanks, Galen |
I didn't run this initially or have the output so I don't know how many iterations there were to normalize it. |
@aaroncblack @pearce8 Can you please provide us with the UMT configs for Rocinante / Crossroads, I believe @aaroncblack or @richards12 ran this on Roci, @dmageeLANL did not run this. |
For roci, I believe @richards12 used an intel compiler build with most likely "-O2" optimization and no other compiler tweaks. That is what I did on my local LLNL intel platform. In the lanl repo under the umt docs area I see his graph used the data points at 1, 8, 32, 56, 88, and 112 cores for both benchmark runs ( SPP1 and SPP2 problems ). You'll want to target half the node memory on these ( 128GB per node on roci? So target 64GB memory use). The problem size can be adjusted by changing the size of the mesh with the "-B global -d x, y, z" where x,y,z is the number of mesh tiles in each axis dimension. I tested locally at LLNL and found these numbers to work the best to get at/around 64GB for the problem. bash-4.4$ srun -n1 ./install/bin/test_driver -B global -d 14,14,14 -b 1 Change the '-n1' to 1, 8, 32, 56, 88, 112 for the runs. Between each cycle umt will output a line like: If you multiply that by the # ranks you should get a rough estimate on total memory usage. |
@dmageeLANL Can you run as @aaroncblack describes above? Thx |
@richards12 It would be helpful to have your scripts to run UMT again in the same way you ran it. |
@aaroncblack Those instructions look reasonable, I'll give it a shot later today. I'll let you know if I run into any issues. @gshipman @pearce8 @richards12 |
I got your package Dave. But I don't really know what it means. I see there's a lot more packages in umt_workspace (metis, mfem, hypre). Does UMT require these? Also, I see that there are results there which means there's a number of iterations. Does this mean we don't need to re run it and the rest of this message is moot? I've built UMT on roci with conduit with the default environment: PrgEnv intel. I'm using the UMT in the benchmarks repo and the head of the develop branch of conduit (0.9.2). The build went generally smoothly, I built both with cmake. But runtime: ~ srun -N 1 -n 1 ./installs/bin/test_driver -B global -d 14,14,14 -b 1
Teton driver: number of MPI ranks: 1
Teton driver: Running predefined benchmark problem UMT SP#1
Teton driver: Threading enabled, max number of threads is 2
Teton driver: Rebuild with Conduit 0.8.9 or later to use tiled meshes.
srun: error: nid001109: task 0: Exited with exit code 1
srun: Terminating StepId=1412488.11 Which is weird because it's conduit 0.9.2. I tried setting export MPICH_SMP_SINGLE_COPY_MODE=CMA, MPICH_MAX_THREAD_SAFETY=multiple but no dice. There's absolutely no information about the error. |
An older version of UMT required MFEM, but now we only need conduit.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Daniel J Magee ***@***.***>
Sent: Monday, July 8, 2024 4:49:08 PM
To: lanl/benchmarks ***@***.***>
Cc: Black, Aaron C. ***@***.***>; Mention ***@***.***>
Subject: Re: [lanl/benchmarks] Normalize UMT FOM by the number of iterations (PR #103)
I got your package Dave. But I don't really know what it means. I see there's a lot more packages in umt_workspace (metis, mfem, hypre). Does UMT require these? Also, I see that there are results there which means there's a number of iterations. Does this mean we don't need to re run it and the rest of this message is moot?
________________________________
I've built UMT on roci with conduit with the default environment: PrgEnv intel. I'm using the UMT in the benchmarks repo and the head of the develop branch of conduit (0.9.2). The build went generally smoothly, I built both with cmake. But runtime:
~ srun -N 1 -n 1 ./installs/bin/test_driver -B global -d 14,14,14 -b 1
Teton driver: number of MPI ranks: 1
Teton driver: Running predefined benchmark problem UMT SP#1
Teton driver: Threading enabled, max number of threads is 2
Teton driver: Rebuild with Conduit 0.8.9 or later to use tiled meshes.
srun: error: nid001109: task 0: Exited with exit code 1
srun: Terminating StepId=1412488.11
Which is weird because it's conduit 0.9.2. I tried setting export MPICH_SMP_SINGLE_COPY_MODE=CMA, MPICH_MAX_THREAD_SAFETY=multiple but no dice. There's absolutely no information about the error.
—
Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https://github.com/lanl/benchmarks/pull/103*issuecomment-2215552733__;Iw!!G2kpM7uM-TzIFchu!y0DBmvipjH7hyCC0dC_LZf90GFKeDR_iLEv7P7FR_8Qv7PCXNnh3I5IQHY_nfjpldot-b62kHDqqynno_hqM1U5B2Co$>, or unsubscribe<https://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AELLW46C3BUREK36JGPWWA3ZLMQPJAVCNFSM6AAAAABKGS26D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJVGU2TENZTGM__;!!G2kpM7uM-TzIFchu!y0DBmvipjH7hyCC0dC_LZf90GFKeDR_iLEv7P7FR_8Qv7PCXNnh3I5IQHY_nfjpldot-b62kHDqqynno_hqMhGa0TRE$>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@dmageeLANL , you said you are using the develop branch, i know some release processes only embed a version number into the build in tagged releases. Maybe UMT is looking for a version number and cant find it cause you have develop. |
The head of develop is tagged as 0.9.2. |
Daniel,
As Aaron mentioned, back when I did these runs UMT had more dependencies. Now that I think about it, that version of UMT also had a different input format and problem description. So I’m not sure how relevant any of the trials I did will be to the current version which has a significantly different problem definition.
Probably the most insight that you can find in the files is the input scripts to give you a sense of how I did the testing for different problem sizes. For different problem sizes (each R is a different problem size) I was running scaling across different numbers of MPI ranks. It looks like I also did multiple trials to check reproducibility of results.
Dave
…-------------------
David Richards
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
From: Daniel J Magee ***@***.***>
Date: Monday, July 8, 2024 at 4:49 PM
To: lanl/benchmarks ***@***.***>
Cc: Richards, David ***@***.***>, Mention ***@***.***>
Subject: Re: [lanl/benchmarks] Normalize UMT FOM by the number of iterations (PR #103)
I got your package Dave. But I don't really know what it means. I see there's a lot more packages in umt_workspace (metis, mfem, hypre). Does UMT require these? Also, I see that there are results there which means there's a number of iterations. Does this mean we don't need to re run it and the rest of this message is moot?
________________________________
I've built UMT on roci with conduit with the default environment: PrgEnv intel. I'm using the UMT in the benchmarks repo and the head of the develop branch of conduit (0.9.2). The build went generally smoothly, I built both with cmake. But runtime:
~ srun -N 1 -n 1 ./installs/bin/test_driver -B global -d 14,14,14 -b 1
Teton driver: number of MPI ranks: 1
Teton driver: Running predefined benchmark problem UMT SP#1
Teton driver: Threading enabled, max number of threads is 2
Teton driver: Rebuild with Conduit 0.8.9 or later to use tiled meshes.
srun: error: nid001109: task 0: Exited with exit code 1
srun: Terminating StepId=1412488.11
Which is weird because it's conduit 0.9.2. I tried setting export MPICH_SMP_SINGLE_COPY_MODE=CMA, MPICH_MAX_THREAD_SAFETY=multiple but no dice. There's absolutely no information about the error.
—
Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https:/github.com/lanl/benchmarks/pull/103*issuecomment-2215552733__;Iw!!G2kpM7uM-TzIFchu!3qAgTW8heb1cocEc_iF16VMORsS_2jIFhEDd06FKHdEo0r52tR27o5H8xK55RZzeMvLIYdYncnE-WLoak_3NMHuxzNRN$>, or unsubscribe<https://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AATGEVHYTS7CE6MFYV7ZWRDZLMQPJAVCNFSM6AAAAABKGS26D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJVGU2TENZTGM__;!!G2kpM7uM-TzIFchu!3qAgTW8heb1cocEc_iF16VMORsS_2jIFhEDd06FKHdEo0r52tR27o5H8xK55RZzeMvLIYdYncnE-WLoak_3NMN79jzto$>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@dmageeLANL you mentioned you are using the version of UMT in the GitHub.com/lanl/benchmarks repo? This is 6 months old I think: |
I verified that I can build and run on Roci using latest UMT and Conduit.
|
I got it running. Sorry for the confusion, I hadn't noticed that the version of UMT in this repository was older. I used the newest UMT and it worked! |
Sweet! Once you have the performance numbers, please update the csv files for the plots and tables and such in the GitHub pages documentation as well. |
Ok I have results, but I'm not sure which number is the operative one. Here's the full result csv (do the results look reasonable?):
The numbers come from this part of the output, this is from procs=1 problem=1:
I just want to make sure I'm looking at the right numbers and running this correctly before I make any changes. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!