Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 3 timestep restart unit test #3031

Closed
wants to merge 1 commit into from
Closed

Add 3 timestep restart unit test #3031

wants to merge 1 commit into from

Conversation

akshaysridhar
Copy link
Member

@akshaysridhar akshaysridhar commented May 20, 2024

Tests for restart and custom output directory generation. Assumes output_active path will be generated using ClimaUtilities.OutputPathGenerator. Addresses #2604 (not sufficient to close this issue).

test/runtests.jl Outdated Show resolved Hide resolved
@szy21
Copy link
Member

szy21 commented May 20, 2024

Could we add radiation callback and EDMF too? Maybe we can have two cases, a simple one and a more complex one?

@Sbozzolo
Copy link
Member

I think "Make sure restarts work correctly" is a little more nuanced than this.

  • How are the new callback initialized? For example, if you have a callback called every N/dt, does it pick up from the old ones has left or is a new counter?
  • We have to test MPI/GPU cases as well
  • What happens when you change settings across restarts?
  • Accmulated diagnostics will not work across restarts
  • Progress log will probably be off across restarts
  • Where are restart/output files saved? If we use the ActiveLink style and keep restart_dir = .../output_active, we will need to be careful because the base dir associated to output_active changes with the new simulation.

These are some of the first things that come to mind but I am sure that there are much more. Restarts are not easy to handle properly!

@szy21
Copy link
Member

szy21 commented May 20, 2024

I think "Make sure restarts work correctly" is a little more nuanced than this.

  • How are the new callback initialized? For example, if you have a callback called every N/dt, does it pick up from the old ones has left or is a new counter?
  • We have to test MPI/GPU cases as well
  • What happens when you change settings across restarts?
  • Accmulated diagnostics will not work across restarts
  • Progress log will probably be off across restarts
  • Where are restart/output files saved? If we use the ActiveLink style and keep restart_dir = .../output_active, we will need to be careful because the base dir associated to output_active changes with the new simulation.

These are some of the first things that come to mind but I am sure that there are much more. Restarts are not easy to handle properly!

Right, I don't think this PR should close the issue. I moved your comments to the issue so we can discuss how to address them.

@akshaysridhar
Copy link
Member Author

I'm testing all reasonable combinations of this simulation offline, there is a tradeoff w.r.t testing time (assuming that the user sets identical configurations for now). I've updated the description to remove closing keyword cc @szy21 (my understanding of this was that the primary focus here is to ensure that simulation objects have consistent properties and spaces) - I'll add some smaller issues to capture the details which are more closely related to callbacks and progress logging but maintain this PR as a check of "state" accuracy across MPI/GPU combinations. There are other interface related issues when automating this due to default config args and "unused" config options for different combinations of restarted simulations - I think we generally need to update our error handling and hidden default settings w.r.t simulation configurations cc @Sbozzolo .

@szy21
Copy link
Member

szy21 commented May 20, 2024

I think this is a good start. Let's address MPI and GPU in this PR if that's easy. We can address other issues in separate PRs. For callbacks and diagnostics, I think for now we can force the simulation time to be multiple of the N or dt of callback and accumulated diagnostics (or throw a warning if it's not). I don't think we need to worry too much about what happens when changing settings across restarts, as it's very unlikely that one will change the settings during one simulation, so there is no truth to compare with.

@akshaysridhar @Sbozzolo What do you think? We can move the discussion to the issue.

test/restart.jl Outdated Show resolved Hide resolved
@Sbozzolo
Copy link
Member

Yes, this is a perfectly good test to have, but should not close the issue.

@akshaysridhar akshaysridhar force-pushed the as/restart branch 4 times, most recently from 713bfe7 to dc847af Compare May 24, 2024 17:02
@akshaysridhar
Copy link
Member Author

@Sbozzolo Output dirs are now created as temp directories; macos tests time-out, ubuntu tests pass; I can reduce test scope if necessary - (There will be more added tests as we define the restart functionality more precisely in PRs to follow)

test/restart.jl Outdated Show resolved Hide resolved
test/restart.jl Outdated Show resolved Hide resolved
test/restart.jl Outdated
simulation = CA.get_simulation(config)
CA.solve_atmos!(simulation)

restart_dir = simulation.output_dir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for current main, but it will probably not be if we restore OutputPathGenerator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can wait till OutputPathGenerator is reintroduced if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want restart_dir = dirname(parsed_args["restart_file"]) here. This is nice in that it is declarative. I would hope that this is compatible with OutputPathGenerator, or that we can somehow make it compatible with OutputPathGenerator.

(FT(0), FT(0))
@test extrema(parent(Y₁.c.ρq_tot) .- parent(Y.c.ρq_tot)) ==
(FT(0), FT(0))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check some variables in p too.

test/restart.jl Outdated Show resolved Hide resolved
test/restart.jl Outdated
# settings (or when tendency computations conflict with
# dt / t_end parsed args)

for configuration in ["column"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do spheres too, with topography as well?

Copy link
Member

@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, but I think we should keep in mind that this is still a relatively weak test with respect to fully supporting restarts. We should go ahead with this, but we should also think about restarts more in general and what we want to support

@akshaysridhar
Copy link
Member Author

akshaysridhar commented May 28, 2024

This is a good start, but I think we should keep in mind that this is still a relatively weak test with respect to fully supporting restarts. We should go ahead with this, but we should also think about restarts more in general and what we want to support

We may need to look at the HDF5Reader / Writer features: on spheres, we get exact matches for scalar tracers and not for the vectors (e.g O(1e-5) error in Y.f.u\_3 upon restarting. (The ClimaCore tests only look at local_geometry properties)

@Sbozzolo
Copy link
Member

Is the error relative or absolute?

Can you coming up a simple example where you set up a space, set up a vector on the space, save it to HDF5 and read it back and show that they are different?

@akshaysridhar
Copy link
Member Author

Is the error relative or absolute?

Can you coming up a simple example where you set up a space, set up a vector on the space, save it to HDF5 and read it back and show that they are different?

Absolute error : In the ubuntu ci build we have

[ Info: Check file-read from checkpoint data
maximum(abs.(parent(f1Y.c.uₕ) .- parent(f2Y.c.uₕ))) = 2.9802322f-8
maximum(abs.(parent(f1Y.f.u₃) .- parent(f2Y.f.u₃))) = 3.0517578f-5

@akshaysridhar
Copy link
Member Author

akshaysridhar commented Jun 6, 2024

@Sbozzolo The test has been updated now that OutputPathGenerator is merged; please see if the mktempdir is a sufficient solution for this PR - I think we can update test/restart.jl and test/callbacks.jl to verify behaviour of any future improvements to the restart functionality (callbacks / progress / accuracy). (The core feature is still the HDF5Reader within ClimaCore.jl)

test/restart.jl Outdated Show resolved Hide resolved
test/restart.jl Outdated Show resolved Hide resolved
in various configurations

Co-authored-by: Gabriele Bozzola <[email protected]>
@szy21
Copy link
Member

szy21 commented Jul 3, 2024

What's the status of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants