Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not reproducible across restarts #281

Open
aekiss opened this issue Mar 6, 2024 · 10 comments
Open

Not reproducible across restarts #281

aekiss opened this issue Mar 6, 2024 · 10 comments
Labels

Comments

@aekiss
Copy link
Contributor

aekiss commented Mar 6, 2024

copying a Slack DM discussion here

@aidanheerdegen and Utkarsh discovered ACCESS-OM2 is not reproducible across restarts, i.e. 2x1-day runs is different from 1x2-day run. The non-reproducibility was detected via this test in this PR.

I've done some test runs in ~aek156/payu/om2-restart-repro and confirmed this problem occurs even when comparing 2x2-timestep vs 1x4-timestep runs (the shortest possible - can't run for one timestep) - see ~aek156/payu/om2-restart-repro, e.g. see all the md5 differences in

diff /g/data/v45/aek156/outputs/om2-restart-repro/1deg_jra55_iaf_2step/output002/manifests/restart.yaml /g/data/v45/aek156/outputs/om2-restart-repro/1deg_jra55_iaf_4step/output001/manifests/restart.yaml

Aidan found some old TWG notes suggesting we used to have reproducibility across restarts https://cosima.org.au/index.php/2018/12/13/technical-working-group-meeting-december-2018/
Nic's COSIMA repro tests use MOM built with a --repro flag https://github.com/COSIMA/access-om2/blob/master/test/exp_test_helper.py#L217-L218
but it was apparently never turned on for production builds https://github.com/COSIMA/access-om2/blame/master/install.sh#L49

@aekiss aekiss added the bug label Mar 6, 2024
@aekiss
Copy link
Contributor Author

aekiss commented Mar 6, 2024

This indicates to me that the model state is not fully captured/restored by the restart files.

This is a separate issue from #266, which is the occasional non-determinism of runs from the same restart.

@aidanheerdegen
Copy link
Contributor

aidanheerdegen commented Mar 6, 2024

It should be stressed that this doesn't mean the model as it stands isn't reproducible, simply that stopping and restarting the model at different points in time will not give consistent results. An experiment can be reproduced as long as the same run lengths were used at all points in an experiment.

We're in the process of assessing the performance impact of adding the repro flags to production builds.

@aidanheerdegen
Copy link
Contributor

aidanheerdegen commented Mar 6, 2024

This indicates to me that the model state is not fully captured/restored by the restart files.

Yes.

Note that this is only the MOM5 model that is changed by the use of the --repro option.

The relevant extra compiler options used when --repro is set are here

https://github.com/ACCESS-NRI/MOM5/blob/master/bin/mkmf.template.nci#L44

My understanding is that the fp-model options constrain the compiler to not perform value unsafe operations, and also reduced intermediate floating point accuracy. If the model internally has higher floating point accuracies during run-time than it can represent in restart files, then a restart break represents a point in time where the model fields are truncated in precision compared to a model run that does not stop at the same point

https://www.nccs.nasa.gov/images/FloatingPoint_consistency.pdf

https://community.intel.com/t5/Intel-Fortran-Compiler/Question-about-fp-source-vs-fp-precise-vs-fp-consistent/m-p/1165782#M144277

@aekiss
Copy link
Contributor Author

aekiss commented Mar 8, 2024

I hope this old issue is irrelevant #23

@ofa001
Copy link

ofa001 commented Mar 8, 2024

Hi @aekiss Its so long ago, I can't remember if it was acted on, it was certainly passed on. I am wondering if its something to do with forward and leap frog time steps on the ocean, and wether that was fully captured for in the i2o.nc type files. Its so long since I have looked at any of this. Both MOM and CICE run on their own should be OK.

@aidanheerdegen
Copy link
Contributor

As far as I can tell it is reproducible across restarts with the --repro build option for OM2. That is what the Jenkins tests did, and they were reporting success for that test.

@aekiss
Copy link
Contributor Author

aekiss commented Mar 20, 2024

The ACCESS-NRI release of ACCESS-OM2 will include a variant with reproducibility across restarts - see ACCESS-NRI/ACCESS-OM2#53

@aekiss
Copy link
Contributor Author

aekiss commented Mar 24, 2024

However, the restart-reproducible variant will be unable to reproduce historical runs - see https://forum.access-hive.org.au/t/access-om2-bit-repro-testing/1960

@aidanheerdegen
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants