-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel failing #69
Comments
The test SMS_Lh3_D.C96.GFSv15p2.cheyenne_gnu passes - we used the same input files generated by chgres_cube from this test in the intel test and it still fails indicating that this is perhaps a model issue and not a chgres_cube issue. This test also fails on stampede at: |
@pjpegion, @climbfuji @mark-a-potts @llpcarson Can you take a look and see what is happening here? |
I'm looking into it. |
@pjpegion Just for your information, I placed a print statement
It seems that the operation is protected to divide to zero error but it fails anyway. The values for |
@uturuncoglu are you getting this tracback on cheyenne or stampede? I'm getting something much more cryptic on cheyenne: |
It is on Stampede but when I run the model again I got following. So, i think, it is not predictable.
|
If you have access to Stampede, it might help to find the source of the problem. |
I don't have an account there, so not sure how much help I can be. I will do what I can on cheyenne since the model fails there in debug mode. |
What about the FV3 if you compile and run it outside of the CIME? Is it failing with the same way in debug mode? |
I ran it outside of CIME and I get the same error. (also ran the debug executable in the directory of a successful run and it fails, which points to the model and nothing in the run setup) |
Adding @junwang-noaa @DusanJovic-NOAA and @climbfuji so that they are aware of this issue |
Recommend looking at compiler options and, more likely, initial conditions. The test runs fine with the regression test input data (i.e. using rt.sh) on Cheyenne (GNU, Intel) and Hera (Intel) in PROD, REPRO and DEBUG mode. If I find time I will take a look |
@uturuncoglu compiling outside of CIME with Debug on the model runs to completion. |
@pjpegion In cime you can examine the file bld/atm.bldlog.*.gz to see the compiler flags used. |
If you are using Intel compiler all compiler flags are set in cmake/Intel.cmake. For gnu compiler they are set in cmake/GNU.cmake. |
@DusanJovic-NOAA Thanks |
It is better to clarify that the executable created outside of CIME is failing with CIME generated (using chgres) initial condition or not? We had a problem with chgres before (see #58) and it is fixed in NCEPLIBS side but there might be still issue related with chgres. |
@uturuncoglu I can run the SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel case with the model compiled outside of CIME. |
@DusanJovic-NOAA The cime buid does not use the flags in those files. Cime compiler flags are defined in cime/config/ufs/machines/config_compilers.xml |
@jedwards4b Thanks. I didn't know that. I wonder how those flags are passed from CIME to ufs-weather-model's cmake build. |
@DusanJovic-NOAA It's a little convoluted: A file is created in the case directory called Macros.cmake There is also a file in FV3/cime/cime_config/configure_cime.cmake that includes that Macros file and translates the variable names as set in cime to those expected by ufs_weather_model. That configure_cime.cmake file is copied to the src/model/cmake directory and used by the model cmake build. |
I did find a difference in that cime is using the mkl library while the noaa build is not, but I turned off mkl in my sandbox and rebuilt - it still fails in the same way. |
@DusanJovic-NOAA It may also be of interest to note that the ccpp physics package ignores any flags set by cime or by ufs-weather-model and sets it's own. |
Yes. This is a "known issue"/"feature". The flags have been set such that the ccpp-physics code gives b4b identical results with the previous IPD physics code (in what we called REPRO mode) or, more generally, that the ccpp-physics are compiled with exactly the same flags as the previous IPD physics code (in DEBUG, REPRO and PROD mode). If the CIME flags are different, then they are most likely incorrect because not tested/vetted by the ufs-weather-model. If we need to accommodate other SIMD instruction sets, then please let us know and we will make this work. |
I think I've solved the problem. I had to remove the debug flags -ftrapuv -fpe0 . I submit that this does not indicate that the CIME flags are incorrect, rather it indicates that there are questionable floating point values in the model and removing these flags avoids trapping them. I'll run the full set of tests overnight and update the issue in the morning. |
Wohoo. I take your point, but please note that the DEBUG flags we use (see https://github.com/ufs-community/ufs-weather-model/blob/52795b83f0febae0fe030d5cb1da3e5bbafba5e8/cmake/Intel.cmake#L36 for the develop branch, and https://github.com/ufs-community/ufs-weather-model/blob/2487a7b9736b516b5c1faba6f4f88bf3e7b82053/cmake/Intel.cmake#L36 for the ufs_public_release branch) do contain "-ftrapuv -fpe0". And the regression tests for GFS_v15p2 and GFS_v16beta do pass in DEBUG mode (for 24h forecasts), see https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf for the regression testing config. Does this mean the ball is back in the "initital conditions" court? Is it possible for you to use the initial conditions we use for the regression tests (i.e. bypass chgres_cube and only run the model using those ICs)? |
@climbfuji I was using compile_cmake.sh with REPRO=Y DEBUG=Y for comparison and I see from your link that REPRO overrides DEBUG so I wasn't getting the ftrapuv and fpe0 flags in your build. So I rebuilt with REPRO=N and your build runs with those flags so I think I'm back to square one. |
But that lead to the solution because I was also setting both flags in the cime build. So now with DEBUG on (and ftrapuv and fpe0 included) and REPRO off the tests are passing. |
Wow. Good job. I thought we had added a guard in compile_cmake.sh that would prevent setting both of them to true. If not, we should do that (and you the same in CIME in case the user can control that). |
@jedwards4b can you let me know what you changed so I can test it? |
Just to check my understanding - in REPRO mode CCPP does not pass with the floating point debug checks on. This indicates that there actually is some underlying floating point issue in that mode, and that implies that it was a preexisting problem with IPD but it was important to reproduce the exact same behavior in CCPP for validation purposes. Is this correct? So, what is the future of the REPRO flag moving forward? Was that something that only needed for a period to validate CCPP? Will future releases remove this option entirely? It is too late to resolve any floating point problems now, so will we list in the "known bugs" that this issue exists and should be expected? Is it also true that with REPRO=off and DEBUG=true that all tests pass? In other words, when CCPP is not forced to reproduce the old IPD behavior, the floating point problems are actually resolved? |
This is way too complicated to reply to in a GitHub issue. Bottom line is that your assumption is not correct. Mixing REPRO and DEBUG flags doesn't make any sense. You can use REPRO=N DEBUG=N (or omit entirely, because these are the defaults) to get the PROD flags, REPRO=Y DEBUG=N (again, can omit DEBUG=N) to get the REPRO flags, or REPRO=N DEBUG=Y (again, can omit REPRO-N) to get the REPRO flags. For each of those three, the tests run and pass (using our regression testing config - and I believe the same is true for CIME, please confirm).
… On Jan 30, 2020, at 8:29 AM, Rocky Dunlap ***@***.***> wrote:
Just to check my understanding - in REPRO mode CCPP does not pass with the floating point debug checks on. This indicates that there actually is some underlying floating point issue in that mode, and that implies that it was a preexisting problem with IPD and so it was important to reproduce the exact same behavior in CCPP for validation purposes. Is this correct?
So, what is the future of the REPRO flag moving forward? Was that something that only needed for a period to validate CCPP? Will future releases remove this option entirely?
It is too late to resolve any floating point problems now, so will we list in the "known bugs" that this issue exists and should be expected?
Is it also true that with REPRO=off and DEBUG=true that all tests pass? In other words, when CCPP is not forced to reproduce the old IPD behavior, the floating point problems are actually resolved?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#69?email_source=notifications&email_token=AB5C2RIMNKYU2TGFKIYB6ILRALW5VA5CNFSM4KK4IIDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKLMUJI#issuecomment-580307493>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RL5KVW7YIORS77UGHLRALW5VANCNFSM4KK4IIDA>.
|
@pjpegion I'll update the ufs_mrweather and let you know. @rsdunlapiv what @climbfuji says is correct - but I believe that to have ccpp set it's own flags independent of the flags set for ufs_weather or cime is a problem. We need to be able to build the entire model from a consistent set of compiler flags defined in a central location. |
@climbfuji thanks for clarifying - I guess since it is a complex issue, the bigger picture question is what does the end user need to be aware of and what is considered a technical detail to be managed by the workflow and build teams? In other words, does anyone really need to know about the REPRO/DEBUG combinations at the user level? If so then we'd want to try to document the details in a understandable way. But, if this is really a esoteric thing, maybe we just make sure the flags are consistent whether they are set through CIME or the model build - but the user really doesn't need to mess with it. Thoughts? |
First thing I will do is to check if there is a guard in compile_cmake.sh or not. If both DEBUG=Y and REPRO=Y are set, the script should return an error and not overwrite one or the other. I think the user needs to know about DEBUG=Y/N, but not about REPRO (this is only for testing CCPP against IPD). |
@climbfuji glad to hear that REPRO is not user facing (I think it would be hard to explain this to a general audience). So, REPRO will be handled internally. I agree that DEBUG mode is a user-facing option and they should be aware of how to activate it. |
Users also do not need to know anything about two compile scripts in the tests directory. Those scripts are internal to regression test and must be left undocumented. We will be changing them as we need to support various regression test requirements. The only supported way of building ufs-weather-model is build.sh script in the top-level directory. Which is what is documented here: |
Let's close the issue once the guard has been added to compile_cmake.sh. |
@jedwards4b is this ticket closed now? |
Yes |
We thought that #58 was solved with an update to chgres_cube, however the same test is still
failing with 3 different tracebacks:
The text was updated successfully, but these errors were encountered: