Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

floating invalid and atm and ice out of sync in IAF run #22

Closed
aekiss opened this issue Jul 24, 2019 · 39 comments
Closed

floating invalid and atm and ice out of sync in IAF run #22

aekiss opened this issue Jul 24, 2019 · 39 comments
Assignees

Comments

@aekiss
Copy link
Contributor

aekiss commented Jul 24, 2019

Maurice and Ryan have been getting an "atm and ice models out of sync" error in 0.25 deg IAF runs using COSIMA/025deg_jra55_iaf@383b27b

This config uses the latest libaccessom2 (b6caeab) but I don't know of any IAF runs that used anything newer than e8ad372 (from Aug 31, 2018) and there are many differences between them: e8ad372...b6caeab

@aekiss
Copy link
Contributor Author

aekiss commented Jul 24, 2019

Not sure if it helps, but
/short/public/rmh561/access-om2/025deg_jra55_iaf/archive/restart120/accessom2_restart.nml
contains

FORCING_CUR_DATE        = 1959-12-31T00:00:00,
EXP_CUR_DATE    = 1960-01-01T00:00:00

so they're 1 day out of sync

@mauricehuguenin
Copy link

Here are the changes to the config files in 383b27b we did for the latest run which got us the error:

  • we changed the queue, project and shortpath in config.yaml to express, e14 and /short/e14
  • we changed the ice_ocean_timestep = 1200 and the restart_period = 1, 0, 0 in accessom2.nml. With an ice_ocean_timestep = 1800 or 1350, the runs abort either on 09-27 or 12-03.
  • we use the restart120 from /g/data3/hh5/tmp/cosima/access-om2-025/025deg_jra55v13_iaf_gmredi6
  • we changed use_restart_time = .false. in cice_in.nml as we are doing a run with an existing restart

@aekiss
Copy link
Contributor Author

aekiss commented Jul 24, 2019

Restart 120 of 025deg_jra55v13_iaf_gmredi6 is at 2200-01-01T00:00:00. This is 1960 in the 5th JRA55 60-year forcing cycle, but there will be a probably-harmless 1-day or 2-day offset in the forcing in the 5th cycle due to leap year differences: COSIMA/access-om2#149

The simplest thing to do is to keep the start date as 2200-01-01T00:00:00, which is what was done in 025deg_jra55v13_iaf_gmredi6.

If you copy /g/data3/hh5/tmp/cosima/access-om2-025/025deg_jra55v13_iaf_gmredi6/restart120 to your archive and don't change anything (including the directory name), I think it should it should run if you have use_restart_time = .true. in ice/cice_in.nml.

Apologies if this misled you
https://github.com/COSIMA/access-om2/wiki/Tutorials#Starting-an-experiment-from-existing-restarts
I'll add a clarification in that this is only needed if date changes are required.

@aekiss
Copy link
Contributor Author

aekiss commented Jul 24, 2019

You may also need to copy output120/ice/cice_in.nml to your archive - see payu-org/payu#193

@rmholmes
Copy link

Yes I can successfully run and complete a 9 month run by linking directly to the unmodified restart120/ and output120/ from the /g/data3/hh5/tmp/cosima/access-om2-025/025deg_jra55v13_iaf_gmredi6/ (providing I use use_restart_time=.true.).

Now to step through and find the floating invalid cause...

@aekiss
Copy link
Contributor Author

aekiss commented Jul 24, 2019

OK thanks for confirming. I think this issue can now be closed, but feel free to re-open if you see this problem again.

@aekiss aekiss closed this as completed Jul 24, 2019
@rmholmes
Copy link

Ok thanks Andrew. It would be nice to be able to shift the dates back for the new cycle - but this seems to mean that we can't?

@aekiss
Copy link
Contributor Author

aekiss commented Jul 24, 2019

There would certainly be a way, but would require careful setup along the lines of the 2nd method in
https://github.com/COSIMA/access-om2/wiki/Tutorials#starting-a-new-experiment-using-restarts-from-a-previous-experiment

@mauricehuguenin
Copy link

Thanks, I also have a run now which worked until 2200-10-27. I am now looking into the floating invalid message I get.

@mauricehuguenin
Copy link

mauricehuguenin commented Aug 11, 2019

A quick update as we are still getting 'out of sync' errors:

We are trying to run a 1-year simulation with 383b27d by using the restart 120 folder in /g/data/hh5/tmp/cosima/access-om2-025/025_deg_jra55v13_iaf_gmredi6. Except for changing the short path and project in config.yaml, we did no other changes to the files.

  • when using use_restart_time = .true. in ice/cice.nml we can successfully run two consecutive 1-month simulations. However when trying to complete the simulation year by running another 10-months, the model aborts on 2200/10/27 with the error forrtl: error (65): floating invalid.
    You can find the error file access-om2.1102487.r-man2.err in /short/public/mv794/

  • when using use_restart_time = .false. in ice/cice.nml and simulating 12 months until 2200/12/31, we can get past the date when we get the floating invalid message. However, the model encounters an error when de-initializing:
    Error in accessom2_deinit: atm and ocean are out of sync
    atm end date: 1958-03-01T00:00:00.000
    ocean end date: 2200-12-30T00:00:00.000
    1
    forrtrl: error (78): process killed (SIGTERM)
    and the output is not correctly archived. However, I can manually archive the output with payu archive. It may be possible that the sync message arises from a discrepancy between leap days of the ocean and atmosphere parts of the model.

We also looked at the output for the day 2200/10/27 when the above run (first bullet point) gave an error and couldn't find something out of the ordinary.
You can find the error file for this run access-om2.1063328.r-man2.err also in /short/public/mv7494/.
Of note here is also that the model stops on the 30th of December instead of the 31st. Running another 1-day simulation from the 30th until the end of the year with ice_ocean_timestep = 1220 or 1800 unfortunately does not work as I get the message
assertion failed: accessom2_sync_config: total runtime in seconds not integer multiple of ice_ocean_timestep.

Our reason for initially changing use_restart_time = .true. in ice/cice.nml was that we would like to have our simulation start in 1958 again instead of starting in 2200.

@aekiss aekiss reopened this Aug 12, 2019
@rmholmes
Copy link

It is strange that you can get past the floating invalid with one option but not the other...

To address the time syncing problem, I have just tried to get a run going with correct dates by following exactly Andrew's 2nd method in
https://github.com/COSIMA/access-om2/wiki/Tutorials#starting-a-new-experiment-using-restarts-from-a-previous-experiment, using restart120/ in /home/561/rmh561/access-om2/025deg_jra55_iaf/

It fails on initialization no matter the length of run with a
FATAL from PE 332: diag_manager_mod::register_diag_field: file=ocean_snapshot: Invalid_date. Date=1960-02-31 00:00:00.

I think this is because in archive/restart000/accessom2_restart.nml:

 FORCING_CUR_DATE        = 1959-12-31T00:00:00,
 EXP_CUR_DATE    = 1959-12-31T00:00:00

(where I have changed the EXP_CUR_DATE to match FORCING_CUR_DATE). These days aren't the end of the year presumably because of the leap year problems. The diag manager is complaining with monthly output because everything is shifted one day off (and we need monthly output...).

It seems thus that the only option is to "fake it" by shifting all the dates forward by a day and missing a day in the forcing. Does this sound right @nichannah @aekiss? Any help would be appreciated...

@russfiedler
Copy link
Contributor

Is there an OMIP protocol for this sort of thing? i.e. when you perform 5/6 IAF cycles do you start each cycle at the proper starting date or do you continue the time series even though it gets out of whack?

If the former then we really need a standard way to restart a cycle cleanly.

@russfiedler
Copy link
Contributor

@rmholmes MOM/FMS will always fail at some point if you start runs after the 28th of the month and have an increment of 'months' You have to start your runs on 01-MM-YYYY 00:00:00

@rmholmes
Copy link

rmholmes commented Aug 12, 2019

Yes I figured. I tried intervals of years, months or seconds. It failed in every case. Our restart is dated 12-31 so the only option seems to be to fake it somehow.

@russfiedler
Copy link
Contributor

I think the best solution would be to modify the global attributes in the ice restart file to reflect the date that you wish to start from. I would rename the original attributes and store them in the modified file so that everything is tracked correctly.

@rmholmes
Copy link

So if I "fake it", by changing in accessom2_restart.nml:

 FORCING_CUR_DATE        = 1960-01-01T00:00:00,
 EXP_CUR_DATE    = 1960-01-01T00:00:00

(similarly for the date in the ocean file ocean_solo.res), then I can run a month but it fails in the deint stage at the end of the run with:

Error in accessom2_deinit: atm and ice models are out of sync.
atm end date: 1960-02-01T00:00:00.000
ice end date: 1958-02-01T00:00:00.000

Why the two years? I only shifted the dates by 1 day and I'm using use_restart_time = .false. in cice_in.nml. The ice end date is clearly wrong.

Thanks @russfiedler . So do I need to change the time and perhaps nyr in the global attributes of iced.2200-01-01-000.nc in the ice restarts? I thought the option use_restart_time=.false. meant it ignored these times?

@aidanheerdegen
Copy link
Contributor

Here is the code I mentioned in the MOM meeting that I wrote for Rishav, so he could edit his CICE restart files and change the dates

https://gist.github.com/aidanheerdegen/203af6f6e0a87d1d82704eae9608f099

There is some description in the comments on how to use it.

Is that useful?

@rmholmes
Copy link

rmholmes commented Aug 12, 2019

Thanks @aidanheerdegen, that looks like what I need, although it fails with:

[rmh561@raijin3 ice]$ pwd
/short/e14/rmh561/access-om2/archive/025deg_jra55_iaf/restart000/ice
[rmh561@raijin3 ice]$ ./edit_time 
Filename? iced.2200-01-01-00000.nc
forrtl: severe (39): error during read, unit 0, file /short/e14/rmh561/access-om2/archive/025deg_jra55_iaf/restart000/ice/iced.2200-01-01-00000.nc
Image              PC                Routine            Line        Source             
edit_time          000000000047D4BA  Unknown               Unknown  Unknown
edit_time          000000000047BFB6  Unknown               Unknown  Unknown
edit_time          000000000043AD30  Unknown               Unknown  Unknown
edit_time          000000000040690E  Unknown               Unknown  Unknown
edit_time          0000000000405E4F  Unknown               Unknown  Unknown
edit_time          000000000041F919  Unknown               Unknown  Unknown
edit_time          0000000000402D9A  Unknown               Unknown  Unknown
edit_time          0000000000402BDC  Unknown               Unknown  Unknown
libc.so.6          00007F624951FD20  Unknown               Unknown  Unknown
edit_time          0000000000402AD9  Unknown               Unknown  Unknown
[rmh561@raijin3 ice]$

I've overwritten the istep1, time and nyr attributes using matlab for the restarts, back to the 1960-01-01 values that are in /g/data/hh5/tmp/cosima/access-om2-025/025deg_jra55v13_iaf_gmredi6/restart000/ice/iced.1960-01-01-00000.nc. Hopefully that does the trick.

@aidanheerdegen
Copy link
Contributor

aidanheerdegen commented Aug 12, 2019

Oh right, these are netCDF files so ignore me. The thing I wrote was for the binary CICE restart files. So as you have done edit the global attributes:

                :month = 1 ;
                :mday = 1 ;
                :sec = 0 ;
                :istep1 = 35040. ;
                :time = 63072000. ;
                :time_forc = 0. ;
                :nyr = 3. ;

and you're laughing

@rmholmes
Copy link

Ok I can get around the syncing problems and use proper dates (starting from 1960) if i change the ice restart netcdf file attributes as above and use use_restart_time=.true. in ice/cice_in.nml. If use_restart_time=.false then the ice model seems to get the wrong date (defaults to the first date of the forcing?) and so the syncing error is thrown. @aekiss I think you need to change your tutorial instructions step 10 and 11?

@mauricehuguenin I would suggest going through as I have and trying to run for a year. The syncing problems should be solved, but the blow-up is probably still there.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 12, 2019

Hi @rmholmes, @aidanheerdegen glad you found a recipe that works, apologies if the tutorial was less than helpful. I've added a quick note to the tutorial linking to this discussion, but if you have a clearer way to present a reliable method let me know.

Tutorial step 10 (use_restart_time = .false.) should mean that istep1, time, time_forc, etc in the restart file are ignored, and namelist values are used instead (see here). So yes, if you edit the restart you'd need to skip that step.

@mauricehuguenin
Copy link

Thanks all for helping! I am now going through the year with 1-month simulations by using Ryan's modified restart files.

@rmholmes
Copy link

Hi @aekiss the problem is that with use_restart_time=.false. it seems to get the wrong time. I presume by "namelist values" you mean it should be taking them from accessom2_restart.nml? It doesn't seem to be doing that as it picks up 1958 instead of 1960.

I had a quick look through the cice5 code but couldn't figure it out.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 13, 2019

Hmm, interesting. accessom2_restart.nml is read by libaccessom2, not cice:

! Read in exp_cur_date and focing_cur_date from restart file.

I presume the start time is supposed to be passed on to cice, to be used whenever use_restart_time=.false., but for some reason this isn't working.

@nichannah can you shed any light on this?

@aekiss aekiss changed the title atm and ice models out of sync in IAF run floating invalid and atm and ice out of sync in IAF run Aug 14, 2019
@aekiss
Copy link
Contributor Author

aekiss commented Aug 14, 2019

grep -v set_date_c /short/public/mv7494/access-om2.1102487.r-man2.err | uniq | less shows the error is

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
fms_ACCESS-OM_50d  0000000001B03591  Unknown               Unknown  Unknown
fms_ACCESS-OM_50d  0000000001B016CB  Unknown               Unknown  Unknown
fms_ACCESS-OM_50d  0000000001AAD804  Unknown               Unknown  Unknown
fms_ACCESS-OM_50d  0000000001AAD616  Unknown               Unknown  Unknown
fms_ACCESS-OM_50d  0000000001A2E539  Unknown               Unknown  Unknown
fms_ACCESS-OM_50d  0000000001A38EDC  Unknown               Unknown  Unknown
libpthread-2.12.s  00002AC6B0AB47E0  Unknown               Unknown  Unknown
fms_ACCESS-OM_50d  00000000004CFFC3  ocean_tempsalt_mo         769  ocean_tempsalt.F90
fms_ACCESS-OM_50d  00000000004C7EF1  ocean_tempsalt_mo         978  ocean_tempsalt.F90
fms_ACCESS-OM_50d  0000000000A2D2FF  ocean_tracer_mod_        2496  ocean_tracer.F90
fms_ACCESS-OM_50d  000000000042BC93  ocean_model_mod_m        1847  ocean_model.F90
fms_ACCESS-OM_50d  00000000004177B6  MAIN__                    464  ocean_solo.F90
fms_ACCESS-OM_50d  000000000040DDDE  Unknown               Unknown  Unknown
libc-2.12.so       00002AC6B0CE0D20  __libc_start_main     Unknown  Unknown
fms_ACCESS-OM_50d  000000000040DCE9  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)

This looks like the same error as in your email on 5 July (it's the same line of the same source file).

I don't know what version of MOM you are using (what is the first hash in the ocean (fms) exe name in config.yaml?) but if it's f8967b1 that line is
https://github.com/mom-ocean/MOM5/blob/f8967b1/src/mom5/ocean_tracers/ocean_tempsalt.F90#L769

@russfiedler emailed some suggestions back in July as to what might be going wrong there (assuming your version has a similar ocean_tempsalt.F90).

@nichannah
Copy link
Contributor

Apologies for not joining this conversation sooner, I've updated my notification settings.

Firstly I tried to reproduce this problem using the information given above and @aekiss instructions under "Simple case: no date change" (See https://github.com/COSIMA/access-om2/wiki/Tutorials#starting-a-new-experiment-using-restarts-from-a-previous-experiment). I came across two instances of what appears to be COSIMA/access-om2#149. As @aekiss mentioned the dates in the restart (/g/data/hh5/tmp/cosima/access-om2-025/025deg_jra55v13_iaf_gmredi6/restart120/accessom2_restart.nml) do not look right:

 &DO_NOT_EDIT_NML
 FORCING_CUR_DATE        = 1959-12-31T00:00:00,
 EXP_CUR_DATE    = 2200-01-01T00:00:00
 /

So after copying the restart120 and output120 dirs into my new archive I modified this to be:

 &DO_NOT_EDIT_NML
 FORCING_CUR_DATE        = 1960-01-01T00:00:00,
 EXP_CUR_DATE    = 2200-01-01T00:00:00
 /

I moved the forcing date forward rather than changing the experiment date because it looked like the other models also thought the date should be 2200-01-01T00:00:00. e.g. in /g/data/hh5/tmp/cosima/access-om2-025/025deg_jra55v13_iaf_gmredi6/restart120/ocean/ocean_solo.res

     3        (Calendar: no_calendar=0, thirty_day_months=1, julian=2, gregorian=3, noleap=4)
  1958     1     1     0     0     0        Model start time:   year, month, day, hour, minute, second
  2200     1     1     0     0     0        Current model time: year, month, day, hour, minute, second

Then, during runtime, I hit the same bug again (COSIMA/access-om2#149) because 1960 is a leap year but 2200 is not.

I then tried a run using the latest yatm (I have copied it to /short/public/access-om2/bin/yatm_7cfdd5dc.exe) which has this bug fixed and this appears to be working. Note that the fix is only about 2 weeks old so is younger than this issue.

More comments to come.

@nichannah
Copy link
Contributor

nichannah commented Aug 14, 2019

I also tried using the "More complicated cases" instructions to restart at 1960-01-01. These instructions look OK but I ran into the same problem as @rmholmes mentioned yesterday with the CICE starting on 1958 when it should be on 1960.

It looks like CICE is pulling out the forcing date from libaccessom2 rather than the experiment/model date. It is possible that use_restart_time = .false. has only been tested when these are the same in which case this would be another bug.

I have created an issue for this COSIMA/cice5#38.

I think the problem has been fixed and a new CICE executable can be found at:

/short/public/access-om2/bin/cice_auscom_1440x1080_480p_ab473434_libaccessom2_7cfdd5dc.exe

Using this new executable Andrew's instructions for modifying the restart date appear to be working.

@mauricehuguenin
Copy link

I don't know what version of MOM you are using (what is the first hash in the ocean (fms) exe name in config.yaml?) but if it's f8967b1 that line is
https://github.com/mom-ocean/MOM5/blob/f8967b1/src/mom5/ocean_tracers/ocean_tempsalt.F90#L769

The executables I used thus far are from June:

- yatm_b6caeab.exe
- fms_ACCESS-OM_50dc61e_libaccessom2_b6caeab.x
- cice_auscom_1440x1080_480p_47650cc_libaccessom2_b6caeab.exe

I now switched to the latest versions from August in /short/public/access-om2/bin/, these being:

- yatm_7cfdd5dc.exe
- fms_ACCESS-OM_da2a93f_libaccessom2_b6caeab.x
- cice_auscom_1440x1080_480p_ab473434_libaccessom2_7cfdd5dc.exe

@mauricehuguenin
Copy link

With the latest executables, namely:

- yatm_7cfdd5dc.exe 
- fms_ACCESS-OM_da2a93f_libaccessom2_b6caeab.x 
- cice_auscom_1440x1080_480p_ab473434_libaccessom2_7cfdd5dc.exe 

I can successfully run a full year starting with WOA initial conditions. I have a small issue with collating the archive files but I think this is a minor issue from my side.

Running a simulation with the same executables and Ryan's modified restart from #22 (comment) I unfortunately still get the out of sync error despite the new CICE version:

Error in accessom2_deinit: atm and ice models are out of sync.
atm end date: 1960-02-01T00:00:00.000
ice end date: 1962-01-31T00:00:00.000
forrtl: error (78): process killed (SIGTERM)

In both cases, I am running with use_restart_time=.true. in ice/cice.nml.

I copied the error logs over to /short/public/mv7494/:

access-om2.1263706.r-man2.out
access-om2.1263706.r-man2.err
025_jra55_iaf.o1263706
025_jra55_iaf.e1263706

Would you recommend me going back to the older executables from #22 (comment)?

@aekiss
Copy link
Contributor Author

aekiss commented Aug 21, 2019

@mauricehuguenin, @rmholmes - Now that COSIMA/access-om2#159 has been resolved, can this issue be closed, or are there some remaining problems?

@rmholmes
Copy link

The floating point has been fixed. I think @mauricehuguenin still had some syncing problems with Nic's new code when use_restart_time=.false. in cice_in.nml? But we can get around them by explicitly setting the time in the ice restart .nc.

@aekiss
Copy link
Contributor Author

aekiss commented Aug 21, 2019

We would like to be able to do a restart by simply setting use_restart_time=.false. rather than hand-editing the restarts, so if there are issues with that it would be good to know the details.

@nichannah
Copy link
Contributor

Please feel free to re-open if there are still problems with this.

@aidanheerdegen
Copy link
Contributor

With this config:

https://github.com/COSIMA/1deg_jra55_iaf.git

If I run for 1 year, and then restart but run for 3 months I get this error:

Error in accessom2_deinit: atm and ice models are out of sync.
atm end date: 1959-04-01T00:00:00.000
ice end date: 1960-03-31T00:00:00.000

Ran at this location:

/home/502/aph502/scratch_v45/scratch/my-test-run

This suggests this is not a solved problem.

@aekiss
Copy link
Contributor Author

aekiss commented Dec 19, 2019

sounds like we should revisit this fix:
COSIMA/cice5#39
COSIMA/cice5#38

@aidanheerdegen
Copy link
Contributor

ping @nichannah

@nichannah
Copy link
Contributor

Thanks @aidanheerdegen, will take a look.

@nichannah
Copy link
Contributor

Fixed. Was a bug when exp date was leap day and forcing was not we repeat forcing for the whole day ... it was not skipping the 'out of sync' check properly during this day.

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/access-om2-control-runs/258/4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants