Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPAS-seaice Error, Potential CFL violation in IR advection for T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-ISMF #5536

Closed
ndkeen opened this issue Mar 20, 2023 · 5 comments · Fixed by #5564
Assignees
Labels
GCP google cloud platform mpas-seaice

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Mar 20, 2023

On gcp12, with GNU, we see error below.
It was passing on this machine on March 14th, and started failing on March 16th.

Impacts both of these tests in extra_coverage:

ERS_P480_Ld5.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-ISMF.gcp12_gnu
PEM_P480_Ld5.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-ISMF
cat /home/jason_sarich/e3sm/nightly_tests/e3sm_extra_coverage/ERS_P480_Ld5.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-ISMF.gcp12_gnu.C.20230320_023157/run/log.seaice.0346.err
----------------------------------------------------------------------
Beginning MPAS-seaice Error Log File for task     346 of     480
    Opened at 2023/03/20 02:40:46
----------------------------------------------------------------------

ERROR:  Potential CFL violation in IR advection, global vertex ID =      288587
ERROR:  Speed at vertex =   61.091939783692659
ERROR:  Maximum safe speed =   16.376000007751468
ERROR:  Potential CFL violation in IR advection, global vertex ID =       40506
ERROR:  Speed at vertex =   58.750092192842729
ERROR:  Maximum safe speed =   16.158747700557296
ERROR:  Potential CFL violation in IR advection, global vertex ID =      224156
ERROR:  Speed at vertex =   58.309477374528683
ERROR:  Maximum safe speed =   16.158747700557296
ERROR:  Potential CFL violation in IR advection, global vertex ID =      278795
ERROR:  Speed at vertex =   29.503025613854842
ERROR:  Maximum safe speed =   15.677157467658768
ERROR:  Potential CFL violation in IR advection, global vertex ID =       77294
ERROR:  Speed at vertex =   30.668745456468578
ERROR:  Maximum safe speed =   15.657828305964950
ERROR:  Potential CFL violation in IR advection, global vertex ID =       54209
ERROR:  Speed at vertex =   58.686322639600341
ERROR:  Maximum safe speed =   15.713337177136138
ERROR:  Potential CFL violation in IR advection, global vertex ID =      465451
ERROR:  Speed at vertex =   57.185852442708253
ERROR:  Maximum safe speed =   15.928740160409692
ERROR:  Potential CFL violation in IR advection, global vertex ID =      440863
ERROR:  Speed at vertex =   56.274645169223319
ERROR:  Maximum safe speed =   15.928740160409692
ERROR: IR: Negative mass in IR: iCat, iCell, global iCell, value: 5 107 52621 -10.3008912216303
CRITICAL ERROR: Runtime error
Logging complete.  Closing file at 2023/03/20 02:40:50
@ndkeen ndkeen added mpas-seaice GCP google cloud platform labels Mar 20, 2023
@rljacob
Copy link
Member

rljacob commented Mar 20, 2023

See if you can reproduce on perlmutter with gnu.

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 20, 2023

Yes it also fails on pm-cpu and chrysalis (with gnu).
Can reproduce with SMS_P480_Ld5.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-ISMF.pm-cpu_gnu
or SMS_P480_Ld5.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-ISMF.chrysalis_gnu

Same sort of error as above when looking at files such as:

-rw-rw-r--  1 ndk ndk       799 Mar 20 11:24  log.seaice.0347.d0544.err
-rw-rw-r--  1 ndk ndk      1865 Mar 20 11:24  log.seaice.0346.d0544.err
-rw-rw-r--  1 ndk ndk      1684 Mar 20 11:24 'log.seaice.0345.d****.err'
-rwxrwxr-t  1 ndk ndk    930472 Mar 20 11:24  abort_seaice_0001-01-01_00.00.00_block_346.nc*
-rwxrwxr-t  1 ndk ndk 312006016 Mar 20 11:24  abort_seaice_0001-01-01_00.00.00.nc*

The fail on chrys:
/lcrc/group/e3sm/ac.ndkeen/scratch/chrys/m33-mar21/SMS_P480_Ld5.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-ISMF.chrysalis_gnu.20230321_152829_sepvu4

I see that SMS_D_P480_Ln6.T62_ECwISC30to60E2r1.GMPAS-DIB-IAF-ISMF.pm-cpu_gnu passed.

I started removing optimization flags, but even if I remove all of them for all compiles, I still get this error with OPT build. If I add -O2 to a DEBUG build, it also fails same way.

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 22, 2023

OK I think there is a floating-point issue that happens with optimization and can be caught with the compiler flags -- building with OPT and adding those flags, I get:

224: #0  0x15020e410d4f in ???
224: #1  0x15020ec99260 in ???
224: #2  0x15020ec6661e in ???
224: #3  0x2658650 in __water_isotopes_MOD_wiso_alpl
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/share/util/water_isotopes.F90:350
224: #4  0x2657782 in __water_isotopes_MOD_wiso_flxoce
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/share/util/water_isotopes.F90:553
224: #5  0x2531a68 in __shr_flux_mod_MOD_shr_flux_atmocn
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/share/util/shr_flux_mod.F90:466
224: #6  0x53f87e in __seq_flux_mct_MOD_seq_flux_atmocn_mct
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/driver-mct/main/seq_flux_mct.F90:1621
224: #7  0x414c2c in cime_run_atmocn_fluxes
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/driver-mct/main/cime_comp_mod.F90:3775
224: #8  0x412e35 in cime_run_atmocn_setup
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/driver-mct/main/cime_comp_mod.F90:4094
224: #9  0x41d03d in __cime_comp_mod_MOD_cime_run
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/driver-mct/main/cime_comp_mod.F90:2832
224: #10  0x436e7a in cime_driver
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/driver-mct/main/cime_driver.F90:153
224: #11  0x436edd in main
224:    at /global/cfs/cdirs/e3sm/ndk/repos/nexty-mar30/driver-mct/main/cime_driver.F90:23

in share/util/water_isotopes.F90:

!Horita and Wesolowski, 1994:                                                                                                                                                                                                                                               
    if(isp == isphdo) then !HDO has different formulation:                                                                                                                                                                                                                  
       write(*,'(a,i10,a,es20.8,es20.8,es20.8,es20.8,es20.8,a,es20.8)') "ndk isp=", isp, " alpal(isp)=", &
            alpal(isp), alpbl(isp), alpcl(isp), alpdl(isp), alpel(isp), " tk=", tk
       write(*,'(a,es20.8)') "ndk value=", alpal(isp)*tk**3 + alpbl(isp)*tk**2 + alpcl(isp)*tk + alpdl(isp) + alpel(isp)/tk**3
      wiso_alpl = exp(alpal(isp)*tk**3 + alpbl(isp)*tk**2 + alpcl(isp)*tk + alpdl(isp) + alpel(isp)/tk**3)  ! <-- error here
    else
      wiso_alpl = exp(alpal(isp)/tk**3 + alpbl(isp)/tk**2 + alpcl(isp)/tk + alpdl(isp))
    end if

and then rebuilding with some debug prints:

224: ndk isp=         3 alpal(isp)=      1.15880000E-09     -1.62010000E-06      7.94840000E-04     -1.61040000E-01      2.99920000E+06 tk=      3.02128618E+02
224: ndk value=      7.19268983E-02
224: ndk isp=         3 alpal(isp)=      1.15880000E-09     -1.62010000E-06      7.94840000E-04     -1.61040000E-01      2.99920000E+06 tk=      3.02103952E+02
224: ndk value=      7.19502518E-02
224: ndk isp=         3 alpal(isp)=      1.15880000E-09     -1.62010000E-06      7.94840000E-04     -1.61040000E-01      2.99920000E+06 tk=      3.02089799E+02
224: ndk value=      7.19636555E-02
224: ndk isp=         3 alpal(isp)=      1.15880000E-09     -1.62010000E-06      7.94840000E-04     -1.61040000E-01      2.99920000E+06 tk=      3.02126360E+02
224: ndk value=      7.19290357E-02
224: ndk isp=         3 alpal(isp)=      1.15880000E-09     -1.62010000E-06      7.94840000E-04     -1.61040000E-01      2.99920000E+06 tk=      2.98804598E+02
224: ndk value=      7.51476143E-02
224: ndk isp=         3 alpal(isp)=      1.15880000E-09     -1.62010000E-06      7.94840000E-04     -1.61040000E-01      2.99920000E+06 tk=      3.01213824E+02
224: ndk value=      7.27983995E-02
224: ndk isp=         3 alpal(isp)=      1.15880000E-09     -1.62010000E-06      7.94840000E-04     -1.61040000E-01      2.99920000E+06 tk=      1.09298113E+06
224: ndk value=      1.51109178E+09

So the value of tk shoots up to e6, which pushed up the value inside exp(value) to e9.
Oh I see tk is temperature, so surely that's not right.

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 24, 2023

After debug printing, I see that tocn or ts array does end up with some rather odd values. I've tracked it at least up to here:

In driver-mct/main/seq_flux_mct.F90

       write(*,'(a,i10,a,i10,a,i10)') " ndk calling shr_flux_atmocn nloc=", nloc, " nloca=", nloca, " nloc_a2o=", nloc_a2o
       do n = 1, nloc
          if (tocn(n) .gt. 500.0) then
             write(*,'(a,es20.8,i10,i2)') " ndk calling shr_flux_atmocn with tocn(n)=", tocn(n), n, mask(n)
          endif
       enddo
       call shr_flux_atmocn (nloc , zbot , ubot, vbot, thbot, &
            shum , shum_16O , shum_HDO, shum_18O, dens , tbot, uocn, vocn , &
            tocn , emask, seq_flux_atmocn_minwind, &
            sen , lat , lwup , &
            roce_16O, roce_HDO, roce_18O,    &
            evap , evap_16O, evap_HDO, evap_18O, taux , tauy, tref, qref , &
            ocn_surface_flux_scheme, &
            duu10n,ustar, re  , ssq, &
            wsresp=wsresp, tau_est=tau_est, ugust=ugust_atm)
 66:  ndk calling shr_flux_atmocn nloc=       576 nloca=       576 nloc_a2o=         0
 66:  ndk calling shr_flux_atmocn with tocn(n)=      1.04022003E+06       113 1
 66:  ndk calling shr_flux_atmocn with tocn(n)=      1.12726246E+06       114 1
 66:  ndk calling shr_flux_atmocn with tocn(n)=      9.13794760E+05       116 1

@jonbob
Copy link
Contributor

jonbob commented Mar 24, 2023

I'm also seeing similar failures on anvil and chrysalis, though not in five-day runs but more like near the end of year 1. I did some digging and it turns out PR #5254, which was known to be non-BFB, has unintended impacts on cryo configurations. That PR was merged to master on March 15, so this fits into the timeline. We're planning on reverting that PR on Monday unless we come up with a fix before then.

jonbob added a commit that referenced this issue Mar 28, 2023
Revert z-star PR

Reverts PR #5254, which modified ocean z-star ALE coordinate for
inactive top cells. More testing shows it causes problems for cryo
configurations with ice shelf cavities.

Fixes #5536

[non-BFB]
@jonbob jonbob closed this as completed in c822da9 Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GCP google cloud platform mpas-seaice
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants