-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark job with IO #884
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! hourly diagnostics (which is the default for t_end: 12h) is too frequent for production runs, so this will be an upper limit for how long diagnostics takes. Any reason why the diagnostics takes much more time for GPU run than for CPU run?
We also have these draft PRs in atmos: CliMA/ClimaAtmos.jl#2646 and CliMA/ClimaAtmos.jl#2852 |
Good point. I don't want us to keep dismissing IO, so I will decrease to one output over the 12h. This is a factor of 60 from monthly means, and I can also reduce the number of variables we output to 5 from 55 so that we are roughly in the same ballpark as a more realistic production run.
The reason is that CPU runs spend much more time on computing, so if you add a little bit of IO, it won't change SYPD very much because that's not the dominant factor. |
The bucket tests was really minimal. Essentially, it just checked that dss was being applied. Now that we no longer apply DSS in Land, the test is no longer informative
d1b5323
to
163a733
Compare
@kmdeck I removed the bucket test here |
Can we run the CPU benchmarks on Caltech HPC? Currently we have:
|
We could, but it is much simpler to run on the same machine because it makes the comparison a little bit more meaningful (e.g., we are using the same disk). Typically this workflow is not running during working hours, and sorry if I disrputed your work. |
The benchmakrs we've been running are not representative of real use cases because they don't save anything to disk (diagnostics/checkpoints). As a result, it is impossible to estimate the cost of any real run (where we care about the output). This PR adds a job with default diagnostics/checkpoints that can be used as reference of real-world usage.