Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel failing #69

Closed
jedwards4b opened this issue Jan 23, 2020 · 42 comments
Closed

Test SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel failing #69

jedwards4b opened this issue Jan 23, 2020 · 42 comments

Comments

@jedwards4b
Copy link
Collaborator

We thought that #58 was solved with an update to chgres_cube, however the same test is still
failing with 3 different tracebacks:

14:MPT: #6  gfdl_cloud_microphys::gfdl_cloud_microphys_run (
14:MPT:     levs=<error reading variable: Cannot access memory at address 0x3>, 
14:MPT:     im=<error reading variable: Cannot access memory at address 0x0>, con_g=0, 
14:MPT:     con_fvirt=0, con_rd=0, frland=..., garea=..., islmsk=..., gq0=..., 
14:MPT:     gq0_ntcw=..., gq0_ntrw=..., gq0_ntiw=..., gq0_ntsw=..., gq0_ntgl=..., 
14:MPT:     gq0_ntclamt=..., gt0=..., gu0=..., gv0=..., vvl=..., prsl=..., phii=..., 
14:MPT:     del=..., rain0=..., ice0=..., snow0=..., graupel0=..., prcp0=..., sr=..., 
14:MPT:     dtp=450, hydrostatic=.FALSE., phys_hydrostatic=4294967295, lradar=.FALSE., 
14:MPT:     refl_10cm=..., reset=4294967295, effr_in=4294967295, rew=..., rei=..., 
14:MPT:     rer=..., res=..., reg=..., errmsg=..., errflg=0, .tmp.ERRMSG.len_V$12a=512)
14:MPT:     at /glade/scratch/jedwards/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.20200123_114400_ntheci/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_cloud_microphys.F90:263
94:MPT: #6  hedmf::hedmf_run (ix=959008295, 
94:MPT:     im=<error reading variable: Cannot access memory at address 0x2>, km=-1, 
94:MPT:     ntrac=<error reading variable: Cannot access memory at address 0x14>, 
94:MPT:     ntcw=973450256, dv=..., du=..., tau=..., rtg=..., u1=..., v1=..., t1=..., 
94:MPT:     q1=..., swh=..., hlw=..., xmu=..., psk=..., rbsoil=..., zorl=..., 
94:MPT:     u10m=..., v10m=..., fm=..., fh=..., tsea=..., heat=..., evap=..., 
94:MPT:     stress=..., spd1=..., kpbl=..., prsi=..., del=..., prsl=..., prslk=..., 
94:MPT:     phii=..., phil=..., delt=450, dspheat=4294967295, dusfc=..., dvsfc=..., 
94:MPT:     dtsfc=..., dqsfc=..., hpbl=..., hgamt=..., hgamq=..., dkt=..., kinver=..., 
94:MPT:     xkzm_m=1, xkzm_h=1, xkzm_s=1, lprnt=.FALSE., ipr=10, 
94:MPT:     xkzminv=0.29999999999999999, moninq_fac=1, errmsg=..., errflg=0, 
94:MPT:     .tmp.ERRMSG.len_V$f8=512)
94:MPT:     at /glade/scratch/jedwards/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.20200123_114400_ntheci/bld/atm/obj/FV3/ccpp/physics/physics/moninedmf.f:511
41:MPT: #6  0x0000000002e1371e in module_radiation_astronomy::coszmn (xlon=..., 
41:MPT:     sinlat=<error reading variable: Cannot access memory at address 0x60>, 
41:MPT:     coslat=<error reading variable: Cannot access memory at address 0x60>, 
41:MPT:     solhr=<error reading variable: Cannot access memory at address 0x12>, 
41:MPT:     im=<error reading variable: Cannot access memory at address 0x8>, 
41:MPT:     me=<error reading variable: Cannot access memory at address 0x8>, 
41:MPT:     coszen=..., coszdg=...)
41:MPT:     at /glade/scratch/jedwards/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.20200123_114400_ntheci/bld/atm/obj/FV3/ccpp/physics/physics/radiation_astronomy.f:901
@jedwards4b
Copy link
Collaborator Author

The test SMS_Lh3_D.C96.GFSv15p2.cheyenne_gnu passes - we used the same input files generated by chgres_cube from this test in the intel test and it still fails indicating that this is perhaps a model issue and not a chgres_cube issue. This test also fails on stampede at:
module_radiation_ 901 radiation_astronomy.f

@arunchawla-NOAA
Copy link
Collaborator

@pjpegion, @climbfuji @mark-a-potts @llpcarson

Can you take a look and see what is happening here?

@pjpegion
Copy link
Collaborator

I'm looking into it.

@uturuncoglu
Copy link
Collaborator

@pjpegion Just for your information, I placed a print statement FV3/ccpp/physics/physics/radiation_astronomy.f because it was giving error as following

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
ufs.exe            0000000004E0E53F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B8BDADEA5D0  Unknown               Unknown  Unknown
ufs.exe            0000000002CBEF5E  module_radiation_         901  radiation_astronomy.f
ufs.exe            0000000002B8F005  gfs_rrtmg_pre_mp_         319  GFS_rrtmg_pre.F90
ufs.exe            0000000002B043AB  ccpp_fv3_gfs_v16b         112  ccpp_FV3_GFS_v16beta_radiation_cap.F90
ufs.exe            0000000002AF677F  ccpp_static_api_m         147  ccpp_static_api.F90
ufs.exe            0000000002AFC165  ccpp_driver_mp_cc         234  CCPP_driver.F90
ufs.exe            0000000000635A43  atmos_model_mod_m         338  atmos_model.F90
ufs.exe            00000000006292D3  module_fcst_grid_         708  module_fcst_grid_comp.F90

It seems that the operation is protected to divide to zero error but it fails anyway. The values for istsun is changing between 0-8.

@pjpegion
Copy link
Collaborator

@uturuncoglu are you getting this tracback on cheyenne or stampede? I'm getting something much more cryptic on cheyenne:
MPT: header=header@entry=0x7ffde901cc00 "MPT ERROR: Rank 95(g:95) received signal SIGFPE(8).\n\tProcess ID: 6654, Host: r2i4n15, Program: /glade/scratch/pegion/SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel.try/bld/ufs.exe\n\tMPT Version: HPE MPT 2.19 0"...) at sig.c:340"

@uturuncoglu
Copy link
Collaborator

It is on Stampede but when I run the model again I got following. So, i think, it is not predictable.

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
ufs.exe            0000000004E0E52F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B76678E75D0  Unknown               Unknown  Unknown
ufs.exe            0000000002BF5AC5  satmedmfvdifq_mp_         624  satmedmfvdifq.F
ufs.exe            0000000002B3B6FB  ccpp_fv3_gfs_v16b         974  ccpp_FV3_GFS_v16beta_physics_cap.F90
ufs.exe            0000000002AF6A19  ccpp_static_api_m         150  ccpp_static_api.F90
ufs.exe            0000000002AFC165  ccpp_driver_mp_cc         234  CCPP_driver.F90
ufs.exe            0000000000635F79  atmos_model_mod_m         364  atmos_model.F90
ufs.exe            00000000006292D3  module_fcst_grid_         708  module_fcst_grid_comp.F90
libesmf.so         00002B76638FA181  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         00002B76638FDA8F  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         00002B7663DC1165  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         00002B76638FB64A  c_esmc_ftablecall     Unknown  Unknown
libesmf.so         00002B7663FEA41D  esmf_compmod_mp_e     Unknown  Unknown
libesmf.so         00002B76641DDEFF  esmf_gridcompmod_     Unknown  Unknown
ufs.exe            0000000000606419  fv3gfs_cap_mod_mp         999  fv3_cap.F90

@uturuncoglu
Copy link
Collaborator

If you have access to Stampede, it might help to find the source of the problem.

@pjpegion
Copy link
Collaborator

I don't have an account there, so not sure how much help I can be. I will do what I can on cheyenne since the model fails there in debug mode.

@uturuncoglu
Copy link
Collaborator

What about the FV3 if you compile and run it outside of the CIME? Is it failing with the same way in debug mode?

@pjpegion
Copy link
Collaborator

I ran it outside of CIME and I get the same error. (also ran the debug executable in the directory of a successful run and it fails, which points to the model and nothing in the run setup)
Haven't tried to compile it outside of CIME yet. I will try that next.

@arunchawla-NOAA
Copy link
Collaborator

Adding @junwang-noaa @DusanJovic-NOAA and @climbfuji so that they are aware of this issue

@climbfuji
Copy link
Collaborator

Recommend looking at compiler options and, more likely, initial conditions. The test runs fine with the regression test input data (i.e. using rt.sh) on Cheyenne (GNU, Intel) and Hera (Intel) in PROD, REPRO and DEBUG mode.

If I find time I will take a look

@pjpegion
Copy link
Collaborator

@uturuncoglu compiling outside of CIME with Debug on the model runs to completion.
@climbfuji can you tell me where in CIME the compiler flags are set?
Thanks.

@jedwards4b
Copy link
Collaborator Author

@pjpegion In cime you can examine the file bld/atm.bldlog.*.gz to see the compiler flags used.
When you compile outside of cime are you using the same initial conditions as those generated in the cime case?

@DusanJovic-NOAA
Copy link
Collaborator

If you are using Intel compiler all compiler flags are set in cmake/Intel.cmake. For gnu compiler they are set in cmake/GNU.cmake.

@pjpegion
Copy link
Collaborator

@DusanJovic-NOAA Thanks

@uturuncoglu
Copy link
Collaborator

uturuncoglu commented Jan 29, 2020

It is better to clarify that the executable created outside of CIME is failing with CIME generated (using chgres) initial condition or not? We had a problem with chgres before (see #58) and it is fixed in NCEPLIBS side but there might be still issue related with chgres.

@pjpegion
Copy link
Collaborator

@uturuncoglu I can run the SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel case with the model compiled outside of CIME.

@jedwards4b
Copy link
Collaborator Author

@DusanJovic-NOAA The cime buid does not use the flags in those files. Cime compiler flags are defined in cime/config/ufs/machines/config_compilers.xml

@DusanJovic-NOAA
Copy link
Collaborator

@jedwards4b Thanks. I didn't know that. I wonder how those flags are passed from CIME to ufs-weather-model's cmake build.

@jedwards4b
Copy link
Collaborator Author

@DusanJovic-NOAA It's a little convoluted: A file is created in the case directory called Macros.cmake There is also a file in FV3/cime/cime_config/configure_cime.cmake that includes that Macros file and translates the variable names as set in cime to those expected by ufs_weather_model. That configure_cime.cmake file is copied to the src/model/cmake directory and used by the model cmake build.

@jedwards4b
Copy link
Collaborator Author

I did find a difference in that cime is using the mkl library while the noaa build is not, but I turned off mkl in my sandbox and rebuilt - it still fails in the same way.

@jedwards4b
Copy link
Collaborator Author

@DusanJovic-NOAA It may also be of interest to note that the ccpp physics package ignores any flags set by cime or by ufs-weather-model and sets it's own.

@climbfuji
Copy link
Collaborator

Yes. This is a "known issue"/"feature". The flags have been set such that the ccpp-physics code gives b4b identical results with the previous IPD physics code (in what we called REPRO mode) or, more generally, that the ccpp-physics are compiled with exactly the same flags as the previous IPD physics code (in DEBUG, REPRO and PROD mode). If the CIME flags are different, then they are most likely incorrect because not tested/vetted by the ufs-weather-model. If we need to accommodate other SIMD instruction sets, then please let us know and we will make this work.

@jedwards4b
Copy link
Collaborator Author

I think I've solved the problem. I had to remove the debug flags -ftrapuv -fpe0 . I submit that this does not indicate that the CIME flags are incorrect, rather it indicates that there are questionable floating point values in the model and removing these flags avoids trapping them. I'll run the full set of tests overnight and update the issue in the morning.

@climbfuji
Copy link
Collaborator

Wohoo. I take your point, but please note that the DEBUG flags we use (see https://github.com/ufs-community/ufs-weather-model/blob/52795b83f0febae0fe030d5cb1da3e5bbafba5e8/cmake/Intel.cmake#L36 for the develop branch, and https://github.com/ufs-community/ufs-weather-model/blob/2487a7b9736b516b5c1faba6f4f88bf3e7b82053/cmake/Intel.cmake#L36 for the ufs_public_release branch) do contain "-ftrapuv -fpe0". And the regression tests for GFS_v15p2 and GFS_v16beta do pass in DEBUG mode (for 24h forecasts), see https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf for the regression testing config. Does this mean the ball is back in the "initital conditions" court?

Is it possible for you to use the initial conditions we use for the regression tests (i.e. bypass chgres_cube and only run the model using those ICs)?

@jedwards4b
Copy link
Collaborator Author

@climbfuji I was using compile_cmake.sh with REPRO=Y DEBUG=Y for comparison and I see from your link that REPRO overrides DEBUG so I wasn't getting the ftrapuv and fpe0 flags in your build. So I rebuilt with REPRO=N and your build runs with those flags so I think I'm back to square one.

@jedwards4b
Copy link
Collaborator Author

But that lead to the solution because I was also setting both flags in the cime build. So now with DEBUG on (and ftrapuv and fpe0 included) and REPRO off the tests are passing.
(In CIME the combination had a different effect than in the noaa build - in the noaa build combining the flags turned off the debug flags, but in cime the debug flags were on but ccpp was built with CMAKE_BUILD_TYPE=Repro instead of CMAKE_BUILD_TYPE=DEBUG).

@climbfuji
Copy link
Collaborator

Wow. Good job. I thought we had added a guard in compile_cmake.sh that would prevent setting both of them to true. If not, we should do that (and you the same in CIME in case the user can control that).

@pjpegion
Copy link
Collaborator

@jedwards4b can you let me know what you changed so I can test it?
Thanks, Phil

@rsdunlapiv
Copy link
Collaborator

rsdunlapiv commented Jan 30, 2020

Just to check my understanding - in REPRO mode CCPP does not pass with the floating point debug checks on. This indicates that there actually is some underlying floating point issue in that mode, and that implies that it was a preexisting problem with IPD but it was important to reproduce the exact same behavior in CCPP for validation purposes. Is this correct?

So, what is the future of the REPRO flag moving forward? Was that something that only needed for a period to validate CCPP? Will future releases remove this option entirely?

It is too late to resolve any floating point problems now, so will we list in the "known bugs" that this issue exists and should be expected?

Is it also true that with REPRO=off and DEBUG=true that all tests pass? In other words, when CCPP is not forced to reproduce the old IPD behavior, the floating point problems are actually resolved?

@climbfuji
Copy link
Collaborator

climbfuji commented Jan 30, 2020 via email

@jedwards4b
Copy link
Collaborator Author

@pjpegion I'll update the ufs_mrweather and let you know.

@rsdunlapiv what @climbfuji says is correct - but I believe that to have ccpp set it's own flags independent of the flags set for ufs_weather or cime is a problem. We need to be able to build the entire model from a consistent set of compiler flags defined in a central location.

@rsdunlapiv
Copy link
Collaborator

@climbfuji thanks for clarifying - I guess since it is a complex issue, the bigger picture question is what does the end user need to be aware of and what is considered a technical detail to be managed by the workflow and build teams? In other words, does anyone really need to know about the REPRO/DEBUG combinations at the user level? If so then we'd want to try to document the details in a understandable way. But, if this is really a esoteric thing, maybe we just make sure the flags are consistent whether they are set through CIME or the model build - but the user really doesn't need to mess with it. Thoughts?

@climbfuji
Copy link
Collaborator

First thing I will do is to check if there is a guard in compile_cmake.sh or not. If both DEBUG=Y and REPRO=Y are set, the script should return an error and not overwrite one or the other. I think the user needs to know about DEBUG=Y/N, but not about REPRO (this is only for testing CCPP against IPD).

@rsdunlapiv
Copy link
Collaborator

@climbfuji glad to hear that REPRO is not user facing (I think it would be hard to explain this to a general audience). So, REPRO will be handled internally. I agree that DEBUG mode is a user-facing option and they should be aware of how to activate it.

@DusanJovic-NOAA
Copy link
Collaborator

Users also do not need to know anything about two compile scripts in the tests directory. Those scripts are internal to regression test and must be left undocumented. We will be changing them as we need to support various regression test requirements. The only supported way of building ufs-weather-model is build.sh script in the top-level directory. Which is what is documented here:

https://ufs-mr-weather-app.readthedocs.io/projects/ufs-weather-model/en/latest/CompilingCodeWithoutApp.html

@climbfuji
Copy link
Collaborator

Let's close the issue once the guard has been added to compile_cmake.sh.

@ufs-community ufs-community deleted a comment from jedwards4b Jan 30, 2020
@jedwards4b
Copy link
Collaborator Author

All tests are now passing on cheyenne with intel and gnu. @pjpegion if you would like to test again the head of ufs-mrweather-app master (hash c21d286) has all the externals up to date.

@jedwards4b
Copy link
Collaborator Author

@pjpegion I made a mistake in updating ufs-mrweather-app, the corrected hash is 49f3b54.

@arunchawla-NOAA
Copy link
Collaborator

@jedwards4b is this ticket closed now?

@jedwards4b
Copy link
Collaborator Author

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants