-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvhpc
compiler tests are failing on cheyenne/derecho
#1733
Comments
Updating to |
nvhpc
compiler tests are failing on cheyenne
In the CESM3_dev branch two of the tests now PASS: SMS.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_nvhpc.clm-crop FAILED PREVIOUSLY While this one still fails, but now with a floating point exception SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop EXPECTED The cesm.log file shows that there is a problem in ESMF at initialization in creating an ESMF mesh. It doesn't drop PET files by default in this case... cesm.log:
|
Seeing similar errors on Derecho: These PASS: These FAIL: The fails are all in the build now with error message from FATES code like this:
Looking at the code I don't see an obvious problem. I googled about it and there are some NVIDIA nvhpc reports about these kind of errors. But, it's not obvious what the issue is here or how to fix it. |
nvhpc
compiler tests are failing on cheyennenvhpc
compiler tests are failing on cheyenne/derecho
A reminder that nvhpc is important for the flexibility to be able to start using GPU's, and since Derecho has NVIDIA GPU's nvhpc is going to be the most performant compiler on Derecho for it's GPU's. Even though GPU's don't currently look like they are important for most uses of CTSM. This will be important for ultra high resolution. And in the future as hardware changes it's important to have flexibility in the model to take advantage of different types of hardware in order to keep the model working well. |
Corrected that Derecho has NVIDIA GPU's.. And from talking with @sherimickelson and slides presented by her group on Sep/12th/2023 CSEG meeting, nvhpc and cray compilers work for the Derecho GPU's, but intel-oneapi wasn't at the time. |
We talked about this in the CSEG meeting. The takeaways are:
|
This is great news and thanks, @ekluzek for sharing this and for your support. |
In what will be ctsm5.3.014 the test SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop now passes the BUILD phase and fails at RUN, pretty early in the mediator. cesm.log:
|
Thanks for the update @ekluzek . Which version of the nvhpc compiler are you using? |
That signal 8 is an FPE exception -- maybe a divide-by-zero. If running in debug mode, there should be a traceback. Ordinarily, I'd guess this is a bug in the code, not the compiler, but with NVHPC it's more iffy. |
@sherimickelson that version is using ccs_config_cesm1.0.10 which uses nvhpc/24.3 There are two tests that PASS and have been passing as well: SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen @briandobbins that's all the traceback it gives. So not much to go off of. One next thing I'd like to try is to run a bunch more tests (maybe all of them?) to see what works and what doesn't.... |
In what will be ctsm5.3.015 there's a test that goes back to failing at the build step, because nvhpc fails at compiling one file:
It's fails at that point and doesn't give any insight into why. I've added everything to the "ftn" interface that I can see. There must be nvhpc specific arguments that can be given using the "-M" syntax. |
Brief summary of bug
MPI tests with DEBUG on are failing at runtime with the nvhpc compiler on cheyenne.
This continues in ctsm5.1.dev155-38-g5c8f17b1a (derecho1 branch) on derecho
General bug information
CTSM version you are using: ctsm5.1.dev082 in cesm2_3_alpha08d
Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: tests with nvhpc and DEBUG on
Details of bug
These tests fail:
SMS_D.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_D.f45_f45_mg37.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default
SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
While DEBUG off tests PASS:
SMS.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default
As well as mpi-serial tests:
SMS_D_Ld1_Mmpi-serial.1x1_brazil.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Ld1_Mmpi-serial.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default
SMS_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default
Important details of your setup / configuration so we can reproduce the bug
Important output or errors that show the problem
For the smallest case: SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
The only log file available is the cesm.log file as follows.
cesm.log file:
The text was updated successfully, but these errors were encountered: