Add benchmark job with IO #884

Sbozzolo · 2024-07-03T14:45:11Z

The benchmakrs we've been running are not representative of real use cases because they don't save anything to disk (diagnostics/checkpoints). As a result, it is impossible to estimate the cost of any real run (where we care about the output). This PR adds a job with default diagnostics/checkpoints that can be used as reference of real-world usage.

┌──────────────────────────┬───────────────────────┬────────────────────────┬─────────────────────────┐
│            Build ID: 200 │ Horiz. res.: 30 elems │ CPU Run [64 processes] │       GPU Run [4 A100s] │
│                          │ Vert. res.: 63 levels │                        │                         │
│                          │           dt: 120secs │                        │                         │
├──────────────────────────┼───────────────────────┼────────────────────────┼─────────────────────────┤
│                          │               job ID: │          amip_diagedmf │       gpu_amip_diagedmf │
│                  Coupled │                 SYPD: │                 0.0378 │                  1.0704 │
│                          │          CPU max RSS: │              6.714 GiB │              10.208 GiB │
├──────────────────────────┼───────────────────────┼────────────────────────┼─────────────────────────┤
│                          │               job ID: │       amip_diagedmf_io │    gpu_amip_diagedmf_io │
│          Coupled with IO │                 SYPD: │                 0.0357 │                  0.5353 │
│                          │          CPU max RSS: │              6.699 GiB │              11.015 GiB │
├──────────────────────────┼───────────────────────┼────────────────────────┼─────────────────────────┤
│                          │               job ID: │    climaatmos_diagedmf │ gpu_climaatmos_diagedmf │
│    Atmos with diag. EDMF │                 SYPD: │                 0.0403 │                  1.0845 │
│                          │          CPU max RSS: │              6.737 GiB │               9.446 GiB │
├──────────────────────────┼───────────────────────┼────────────────────────┼─────────────────────────┤
│                          │               job ID: │             climaatmos │          gpu_climaatmos │
│ Atmos without diag. EDMF │                 SYPD: │                 0.1856 │                  4.4748 │
│                          │          CPU max RSS: │              6.071 GiB │               8.755 GiB │

szy21

Thanks! hourly diagnostics (which is the default for t_end: 12h) is too frequent for production runs, so this will be an upper limit for how long diagnostics takes. Any reason why the diagnostics takes much more time for GPU run than for CPU run?

szy21 · 2024-07-03T15:52:05Z

We also have these draft PRs in atmos: CliMA/ClimaAtmos.jl#2646 and CliMA/ClimaAtmos.jl#2852

Sbozzolo · 2024-07-03T15:58:03Z

Thanks! hourly diagnostics (which is the default for t_end: 12h) is too frequent for production runs, so this will be an upper limit for how long diagnostics takes.

Good point. I don't want us to keep dismissing IO, so I will decrease to one output over the 12h. This is a factor of 60 from monthly means, and I can also reduce the number of variables we output to 5 from 55 so that we are roughly in the same ballpark as a more realistic production run.

Any reason why the diagnostics takes much more time for GPU run than for CPU run?

The reason is that CPU runs spend much more time on computing, so if you add a little bit of IO, it won't change SYPD very much because that's not the dominant factor.

The bucket tests was really minimal. Essentially, it just checked that dss was being applied. Now that we no longer apply DSS in Land, the test is no longer informative

Sbozzolo · 2024-07-03T16:05:46Z

@kmdeck I removed the bucket test here

akshaysridhar · 2024-07-03T17:08:38Z

Can we run the CPU benchmarks on Caltech HPC? Currently we have:

CPU ClimaAtmos with diagnostic EDMF
   NodeList=clima
   BatchHost=clima

Sbozzolo · 2024-07-03T18:51:41Z

Can we run the CPU benchmarks on Caltech HPC? Currently we have:
CPU ClimaAtmos with diagnostic EDMF
   NodeList=clima
   BatchHost=clima

We could, but it is much simpler to run on the same machine because it makes the comparison a little bit more meaningful (e.g., we are using the same disk). Typically this workflow is not running during working hours, and sorry if I disrputed your work.

Sbozzolo requested a review from szy21 July 3, 2024 14:45

szy21 approved these changes Jul 3, 2024

View reviewed changes

Sbozzolo added 2 commits July 3, 2024 09:04

Add benchmark job with IO

20b8fc9

Remove bucket test

163a733

The bucket tests was really minimal. Essentially, it just checked that dss was being applied. Now that we no longer apply DSS in Land, the test is no longer informative

Sbozzolo force-pushed the gb/benchmark_with_io branch from d1b5323 to 163a733 Compare July 3, 2024 16:04

Sbozzolo merged commit b4733e9 into main Jul 3, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark job with IO #884

Add benchmark job with IO #884

Sbozzolo commented Jul 3, 2024

szy21 left a comment

szy21 commented Jul 3, 2024

Sbozzolo commented Jul 3, 2024

Sbozzolo commented Jul 3, 2024

akshaysridhar commented Jul 3, 2024

Sbozzolo commented Jul 3, 2024

Add benchmark job with IO #884

Add benchmark job with IO #884

Conversation

Sbozzolo commented Jul 3, 2024

szy21 left a comment

Choose a reason for hiding this comment

szy21 commented Jul 3, 2024

Sbozzolo commented Jul 3, 2024

Sbozzolo commented Jul 3, 2024

akshaysridhar commented Jul 3, 2024

Sbozzolo commented Jul 3, 2024