Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect restart files and add tests for restarts #3307

Merged
merged 6 commits into from
Sep 27, 2024

Conversation

Sbozzolo
Copy link
Member

@Sbozzolo Sbozzolo commented Sep 16, 2024

This commit adds a new option, detect_restart_file. When detect_restart_file is true, ClimaAtmos looks at the previous output_XXXX output folder and tries to see if there's a restart file in it. If yes, it uses it. This behavior is overridden when a restart file is explicitely passed. This commit uses and re-implments the same logic implemented in ClimaUtilities.OutputPathGenerator. (We might want to move that logic to ClimaUtilities at some point.)

This PR also adds a test to ensure that restarted simulations are identical to non-restarted ones.

This is accomplished by running a simulation for three steps and comparing it to two other cases:

  1. A simulation restarted from the checkpoint produced at step 3
  2. A simulation restarted from the checkpoint produced at step 2 and solved

This is a variation of #3031
Closes #3031

This is the first step in improving compatibility with restarts, and more work is needed. There is a little bit of commented out code to loop over different configurations. I am working on it now, but I would like to start getting this test merged in, as I add more complexity and fix things.

GPU tests are commented out because of ClimaCore bug,
MPI tests are commented out because of process hanging.

I will work on this next

@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch 10 times, most recently from 933d5b2 to dfd67b7 Compare September 17, 2024 17:44
@Sbozzolo Sbozzolo changed the title Automatically detect restart files Automatically detect restart files and add tests for restarts Sep 19, 2024
@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch 6 times, most recently from 8a324d8 to 85b525b Compare September 19, 2024 17:10
@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch 3 times, most recently from a63e048 to f11c3b6 Compare September 19, 2024 17:48
@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch from f11c3b6 to cde3ec3 Compare September 19, 2024 18:11
@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch from cf70fba to 0a7ccd7 Compare September 23, 2024 23:01
Copy link
Member

@charleskawczynski charleskawczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the source code, I'm fine with most of the changes. But I think we can improve a few things:

  • get_restart_file
  • compare
  • format of test/restart.jl

I'll comment locally to go into details.

test/restart.jl Outdated Show resolved Hide resolved
test/restart.jl Outdated Show resolved Hide resolved
@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch from 0a7ccd7 to 8533bfa Compare September 24, 2024 15:14
test/restart.jl Outdated Show resolved Hide resolved
test/restart.jl Outdated Show resolved Hide resolved
@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch 3 times, most recently from 563b580 to c5211f2 Compare September 25, 2024 15:00
@Sbozzolo
Copy link
Member Author

This is how the output looks like in case of failure:

config = sphere equil 0M NoWarp
solve!:: 0.554812 seconds (2.24 k allocations: 1.059 MiB, 1.74% compilation time)
    just reading data
integrator.p.precomputed.ᶜlinear_buoygrad error: 0.00020609275
integrator.p.precomputed.ᶜgradᵥ_θ_virt.components.data.1 error: 0.00023572937
integrator.p.precomputed.ᶜgradᵥ_θ_liq_ice.components.data.1 error: 0.00023572937
    reading and simulating
solve!:: 0.000005 seconds (1 allocation: 400 bytes)
integrator.p.precomputed.ᶜlinear_buoygrad error: 0.00018343895
integrator.p.precomputed.ᶜgradᵥ_θ_virt.components.data.1 error: 0.00018323408
integrator.p.precomputed.ᶜgradᵥ_θ_liq_ice.components.data.1 error: 0.00018323408
config = box equil 0M NoWarp
solve!:: 0.135827 seconds (2.60 k allocations: 493.500 KiB, 10.25% compilation time)
    just reading data
integrator.p.precomputed.ᶜlinear_buoygrad error: 0.00043861143
integrator.p.precomputed.ᶜgradᵥ_θ_virt.components.data.1 error: 0.0004185408
integrator.p.precomputed.ᶜgradᵥ_θ_liq_ice.components.data.1 error: 0.0004185408
    reading and simulating
solve!:: 0.000002 seconds (1 allocation: 400 bytes)
integrator.p.precomputed.ᶜlinear_buoygrad error: 0.00035853966
integrator.p.precomputed.ᶜgradᵥ_θ_virt.components.data.1 error: 0.0003376229
integrator.p.precomputed.ᶜgradᵥ_θ_liq_ice.components.data.1 error: 0.0003376229
config = column equil 0M NoWarp
┌ Warning: perturb_initstate flag is ignored for single column configuration
└ @ ClimaAtmos /central/scratch/esm/slurm-buildkite/climaatmos-ci/20797/climaatmos-ci/src/solver/type_getters.jl:193
┌ Warning: Bubble correction not compatible with single column configuration. It will be switched off.
└ @ ClimaAtmos /central/scratch/esm/slurm-buildkite/climaatmos-ci/20797/climaatmos-ci/src/solver/type_getters.jl:204
solve!:: 0.064094 seconds (2.45 k allocations: 343.078 KiB, 28.11% compilation time)
    just reading data
integrator.p.precomputed.ᶜlinear_buoygrad error: 0.0002045927
integrator.p.precomputed.ᶜgradᵥ_θ_virt.components.data.1 error: 0.0002344887
integrator.p.precomputed.ᶜgradᵥ_θ_liq_ice.components.data.1 error: 0.0002344887
    reading and simulating
solve!:: 0.000003 seconds (1 allocation: 400 bytes)
integrator.p.precomputed.ᶜlinear_buoygrad error: 9.4087634e-5
integrator.p.precomputed.ᶜgradᵥ_θ_virt.components.data.1 error: 0.0001073189
integrator.p.precomputed.ᶜgradᵥ_θ_liq_ice.components.data.1 error: 0.0001073189

@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch 4 times, most recently from ffd0b54 to 1ad8f95 Compare September 25, 2024 16:44
@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch from 1ad8f95 to ba797c2 Compare September 26, 2024 17:15
@Sbozzolo Sbozzolo enabled auto-merge September 26, 2024 23:00
…stics

`face_sw_direct_flux_dn` is used only in
`AllSkyRadiationWithClearSkyDiagnostics` in this block:
```
NVTX.@Annotate function update_sw_fluxes!(
    ::AllSkyRadiationWithClearSkyDiagnostics,
    model,
)
    RRTMGP.RTESolver.solve_sw!(
        model.sw_solver,
        model.as,
        model.lookups.lookup_sw,
        nothing,
        model.lookups.lookup_sw_aero,
    )
    parent(model.face_clear_sw_flux_up) .= parent(model.face_sw_flux_up)
    parent(model.face_clear_sw_flux_dn) .= parent(model.face_sw_flux_dn)
    parent(model.face_clear_sw_direct_flux_dn) .=
        parent(model.face_sw_direct_flux_dn)
    parent(model.face_clear_sw_flux) .= parent(model.face_sw_flux)
    RRTMGP.RTESolver.solve_sw!(
        model.sw_solver,
        model.as,
        model.lookups.lookup_sw,
        model.lookups.lookup_sw_cld,
        model.lookups.lookup_sw_aero,
    )
end
```

It is not used by other modes, so I don't add it
This commit adds a new option, `detect_restart_file`. When
`detect_restart_file` is true, ClimaAtmos looks at the previous
`output_XXXX` output folder and tries to see if there's a restart file
in it. If yes, use it. This behavior is overridden when a restart file
is explicitely passed.

This commit uses and re-implments the same logic implemented in
ClimaUtilities.OutputPathGenerator. We might want to move that logic to
ClimaUtilities at some point.
This commit adds the hash of the AtmosModel instance used to prepare a
simulation in the checkpoint HDF5 files. This is then used to check that
the AtmosModel is the "same" (up to a hash) across restarts.

A test is added to check that this mechanism works. In the test, I
produced a checkpoint and read it with the same configuration with one
value changed (insolation).
This commit adds a test to ensure that restarted simulations are
identical to non-restarted ones.

This is accomplished by running a simulation for three steps and
comparing it to two other cases:
1. A simulation restarted from the checkpoint produced at step 3
2. A simulation restarted from the checkpoint produced at step 2 and
   solved
@Sbozzolo Sbozzolo force-pushed the gb/automatic_restart branch from 6280df5 to 452f45a Compare September 27, 2024 02:31
@Sbozzolo Sbozzolo added this pull request to the merge queue Sep 27, 2024
Merged via the queue into main with commit e731500 Sep 27, 2024
12 of 16 checks passed
@Sbozzolo Sbozzolo deleted the gb/automatic_restart branch September 27, 2024 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants