Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up simulations to make milestone completion feasible. #98

Open
pscrozi opened this issue May 8, 2024 · 16 comments
Open

Speed up simulations to make milestone completion feasible. #98

pscrozi opened this issue May 8, 2024 · 16 comments
Assignees

Comments

@pscrozi
Copy link
Collaborator

pscrozi commented May 8, 2024

Ilker going to profile simulation, and maybe make changes to Tioga to make this faster.

Nate: I fear we'd still be too slow if we don't get massive speedup. The timestep size we're forced to run at with FSI are going to be too slow. Need 500 - 600 s of simulation time, but looking like can only get to 14 s in a 12 hour run (max allowed on Frontier right now).

Could we throw more nodes at it. Nalu-Wind is currently segfaulting on GPUs.

@pscrozi
Copy link
Collaborator Author

pscrozi commented May 9, 2024

Ilker tried to create a fresh ExaWind driver, but hitting an error on Frontier. (Without Trilinos solvers.) He will start profiling the code after he gets it running. He will talk to Jon Rood about getting the Frontier build working, hopefully tomorrow (5/10). Or he might ask Phil.

@psakievich
Copy link
Collaborator

We need to pin to openfast develop branch to a commit for our exawind work.

@pscrozi
Copy link
Collaborator Author

pscrozi commented May 14, 2024

Ilker was able to build, but is out sick today (5/14).

Jon: Marc HdF and I will work on Tioga performance. Ilker can run and profile this and see what parts of Tioga are bad. Tioga might have MPI issues that need to be addressed, so that it doesn't wait or communicate better. Some base case issues were resolved, but with a bunch of Nalu-Wind instances, things get worse. It is certainly an MPI issue in Tioga. Ilker and I will continue to work this. Making Tioga faster for this case would alleviate time bottlenecks elsewhere in other cases.

@pscrozi
Copy link
Collaborator Author

pscrozi commented May 15, 2024

Jon: still mostly focused on Tioga, with Ilker working on this. Ilker has a profile for this on Frontier, so we just need to look at it. For AMR-Wind, we would need help from LBNL. Nalu-Wind isn't the bottleneck here.

@psakievich
Copy link
Collaborator

Ilker is back at work. He has run multiple cases and just needs to start profiling.

@pscrozi
Copy link
Collaborator Author

pscrozi commented May 17, 2024

Ilker: looked at my results and had debug flags on.

@pscrozi
Copy link
Collaborator Author

pscrozi commented May 20, 2024

Ilker: I am having some exawind-manager trouble with a debug symbols build on Frontier, and I am dealing with that right now.

Jon: I will reach out.

@pscrozi
Copy link
Collaborator Author

pscrozi commented May 29, 2024

Ilker: there are problems with compiling on Frontier. We can run simulations, but there are issues with debug flag outputs not working, and segfaults on the 16 turbine case.

Jon: spent a lot of time figuring out right configuration to run on Frontier. There's an OpenFast commit up in the air. We couldn't build with debug symbols with certain versions of rocm. Using rocm-6 seems to build with debug symbols. We're looking into perhaps a new version of HPC-toolkit. Segfault in OpenFAST, so maybe wrong commit of OpenFAST. Tried OpenFAST-dev that doesn't seem to work currently. Posted in ExaWind channel, but should tag someone.

Going to try a problem without OpenFAST to try a simpler case to profile.

Ganesh and post-doc are willing to help get this going since it is a roadblock for him now.

@pscrozi
Copy link
Collaborator Author

pscrozi commented Jun 3, 2024

Ganesh: just recompiled and ran, but hasn't done more debug cases yet.

Marc: spend time running the 16 turbine case with 3 blade mesh on Frontier, looking for ways to speed it up. Load imbalance. Got 10% speedup, but looking for Nx speedups. Ilker still working on this. Could be an issue with 3 meshes at the hub causing a lot of overset and MPI communication, leading to load imbalances.

Ganesh: has an idea to break apart each blade into 2 blocks to make TIOGA faster.

@pscrozi
Copy link
Collaborator Author

pscrozi commented Jun 5, 2024

Ilker: Ganesh has a specific case that showed a slowdown. Viz the mesh and saw that that was the issue. Discussed with Ganesh. 3 blocks per blade and physical intersection. TIOGA does a search at this intersection. Ganesh has an idea for working around this issue, expected to complete by tomorrow.

@pscrozi
Copy link
Collaborator Author

pscrozi commented Jun 10, 2024

Ganesh: still in progress on this.

Ilker: no updates on this one. Waiting for the OpenFAST issue to be resolved, and build to be complete on Frontier.

Nate: I've got a branch that I know works with FSI, and have it going on Flight at Sandia with exawind_manager with classic intel compiler. Starts and restarts with FSI and CFD with 2 turbines, so I know it is working. Now just need to build on Frontier. Might be old enough that it doesn't include Derek's new changes. Ilker should try to build on Frontier, and there's a chance it might work. Got a sizeable speedup by changing AMR-Wind's blocking-factor=32, and max-grid-size=64, which achieved a better load balancing and/or MPI performance overall.

@pscrozi
Copy link
Collaborator Author

pscrozi commented Jun 12, 2024

Ilker: have built Nate's branch and 16 turbine case, building with debug flags, and then looking at profiling data.

Ganesh: I split the mesh and am exporting it, and should be available here in a little bit.

Nate: change the AMR-Wind blocking size and see if this speeds things up. Observe which part speeds up due to this change. There may be optimal blocking factors. Defaults showed big slowdown. Should be able to see through profiling.

@pscrozi
Copy link
Collaborator Author

pscrozi commented Jun 17, 2024

Ilker: tried profiling on 1024 nodes. Looked at case files. Once this runs, will have a better idea of what to look for. The mesh might not be set up correctly, but need to confirm with creator of mesh (unknown who made it? maybe Ashesh) 3 blades and a tower with some overlap.

Ganesh: done exporting mesh.

@pscrozi
Copy link
Collaborator Author

pscrozi commented Jun 24, 2024

Ilker: 1024 nodes seeing a segfault. Now running smaller case. (Had been running larger case.) No overset between blades and tower? May be an error here.

Ganesh: mesh on kestrel, but haven't yet used it. Reduces by half (or more) the volume needed to search.

@pscrozi
Copy link
Collaborator Author

pscrozi commented Jun 26, 2024

Nate: Lawrence posted timings with different numbers of nodes. Discussed yesterday.

Lawrence: I tested the production case on variable number of nodes. Submitted milestone case on 384 nodes.

@pscrozi
Copy link
Collaborator Author

pscrozi commented Aug 12, 2024

Ilker: nothing specific, but runs in the queue. Blocking-factor may make a difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants