Performance outline #635

charleskawczynski · 2022-07-13T19:41:12Z

Key items to tackle

Performance notes / specs

Desired config: 0.1sec wall clock per 400sec simulated time-step
Current state: 8sec wall clock per 400s simulated timestep.
This means performance optimizations and parallelism needs to buy us an 80x speedup (8s/80 = 0.1s)
Recompute performance target based on CFL or target time step at target resolution. Looks like this would be another factor of 3

The text was updated successfully, but these errors were encountered:

821: make GC deterministic in distributed r=simonbyrne a=simonbyrne # PULL REQUEST ## Purpose and Content This should reduce MPI Waitall time by manually triggering the GC across all processes at the same time. ## Benefits and Risks The number of steps will require some tuning to avoid out-of-memory errors ## Linked Issues - Item 3 of #635 - Mentioned in #686 - Supersedes #687 ## PR Checklist - [x] This PR has a corresponding issue OR is linked to an SDI. - [x] I have followed CliMA's codebase [contribution](https://clima.github.io/ClimateMachine.jl/latest/Contributing/) and [style](https://clima.github.io/ClimateMachine.jl/latest/DevDocs/CodeStyle/) guidelines OR N/A. - [x] I have followed CliMA's [documentation policy](https://github.com/CliMA/policies/wiki/Documentation-Policy). - [x] I have checked all issues and PRs and I certify that this PR does not duplicate an open PR. - [x] I linted my code on my local machine prior to submission OR N/A. - [x] Unit tests are included OR N/A. - [x] Code used in an integration test OR N/A. - [x] All tests ran successfully on my local machine OR N/A. - [x] All classes, modules, and function contain docstrings OR N/A. - [x] Documentation has been added/updated OR N/A. Co-authored-by: Simon Byrne <[email protected]>

charleskawczynski · 2024-02-06T02:14:51Z

Superseded by #2632

charleskawczynski · 2024-02-07T01:11:43Z

I've excluded the item for time stepper LU factorization for further tracking this issue because there are other higher-level optimizations that we can apply (like parallelizing function calls that the timestepper makes) that effectively nullify reducing the frequency of LU factorizations, which also has the side effect of trading off with the approximation that some physics operate more slowly than others.
I've excluded the impact of loops over mesh components (iterator infrastructure) compared to pre-computed mesh because I don't think that this is relevant to GPU performance, which is what we're more primarily targeting. @sriharshakandala can correct me if I'm wrong there.

charleskawczynski added Performance monitoring 🚀 🔍 Performance labels Jul 13, 2022

simonbyrne mentioned this issue Oct 2, 2022

make GC deterministic in distributed #821

Merged

10 tasks

charleskawczynski mentioned this issue Oct 11, 2023

Use rmul over rdiv CliMA/ClimaCore.jl#1496

Merged

charleskawczynski mentioned this issue Feb 6, 2024

Performance roadmap #2632

Open

charleskawczynski closed this as completed Feb 6, 2024

cmbengue mentioned this issue Apr 29, 2024

O1.2.6 (atmos) 1 SYPD for AMIP on a single A100 #2943

Open

cmbengue assigned charleskawczynski Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance outline #635

Performance outline #635

charleskawczynski commented Jul 13, 2022 •

edited

Loading

charleskawczynski commented Feb 6, 2024

charleskawczynski commented Feb 7, 2024

Performance outline #635

Performance outline #635

Comments

charleskawczynski commented Jul 13, 2022 • edited Loading

Key items to tackle

Performance notes / specs

charleskawczynski commented Feb 6, 2024

charleskawczynski commented Feb 7, 2024

charleskawczynski commented Jul 13, 2022 •

edited

Loading