Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S2SWA app crashes on xjet when initializing CICE #1262

Closed
DavidHuber-NOAA opened this issue Jun 8, 2022 · 7 comments · Fixed by #1563
Closed

S2SWA app crashes on xjet when initializing CICE #1262

DavidHuber-NOAA opened this issue Jun 8, 2022 · 7 comments · Fixed by #1563
Labels
bug Something isn't working

Comments

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Jun 8, 2022

Description

The model crashes on xjet at line 1146 of ice_init.F90 with an "illegal instruction" error when initializing the CICE component. This is verified for the S2SWA app, but is likely also true for any S2S* app.

I believe I have tracked it down to the CMakeLists.txt in CICE-interface, which appends -xHOST to CMAKE_Fortran_FLAGS on line 10.

To Reproduce:

What compilers/machines are you seeing this with? Intel 18.0.5.274 and IMPI 2018.4.274
Give explicit steps to reproduce the behavior.

  1. Check out revision 5c2d1a92 (or likely newer)
  2. Compile the model on one of the head nodes
  3. Copy the executable into a run directory
  4. Submit a slurm job to xjet (or likely any partition except kjet)

Note, I have not tested this on any other partition.

Additional context

The regression tests also run on xjet, but they are also compiled there.

Output

output logs
The complete log is attached with a snippet of the crash itself shown below.

36 + srun -l --export=ALL -n 2844 /lfs1/NESDIS/nesdis-rdo2/David.Huber/para/stmp/RUNDIRS/upd_s2sw/2013040100/gfs/fcst.31082/ufs_model
   0:
   0:
   0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
   0:      PROGRAM ufs       HAS BEGUN. COMPILED       0.00     ORG: np23
   0:      STARTING DATE-TIME  JUN 07,2022  18:41:15.733  158  TUE   2459738
   0:
   0:
   0: UFS Aerosols: Initializing ...
2604:  (input_data) Reading setup_nml
2604:  (input_data) Reading grid_nml
2604:  (input_data) Reading tracer_nml
2604:  (input_data) Reading thermo_nml
2604:  (input_data) Reading dynamics_nml
2604:  (input_data) Reading shortwave_nml
2604:  (input_data) Reading ponds_nml
2604:  (input_data) Reading snow_nml
2604:  (input_data) Reading forcing_nml
2604:  Diagnostic output will be in file
2604:  ice_diag.d
2604:
2679: forrtl: severe (168): Program Exception - illegal instruction
2679: Image              PC                Routine            Line        Source
2679: ufs_model          0000000007AC602E  Unknown               Unknown  Unknown
2679: libpthread-2.17.s  00002BA83838F630  Unknown               Unknown  Unknown
2679: ufs_model          00000000073491E2  ice_init_mp_input        1146  ice_init.F90
2679: ufs_model          00000
[gfsfcst.log](https://github.com/ufs-community/ufs-weather-model/files/8863899/gfsfcst.log)
0000719B652  cice_initmod_mp_c          53  CICE_InitMod.F90
2679: ufs_model          0000000006F03E84  ice_comp_nuopc_mp         624  ice_comp_nuopc.F90
2679: ufs_model          0000000000A2EADE  _ZN5ESMCI6FTable1        2167  ESMCI_FTable.C
2679: ufs_model          0000000000A32A96  ESMCI_FTableCallE         824  ESMCI_FTable.C
2679: ufs_model          0000000000DA6EAF  _ZN5ESMCI3VMK5ent        2308  ESMCI_VMKernel.C
2679: ufs_model          00000000010AEEBA  _ZN5ESMCI2VM5ente        1216  ESMCI_VM.C
2679: ufs_model          0000000000A30177  c_esmc_ftablecall         981  ESMCI_FTable.C
2679: ufs_model          00000000009462C1  esmf_compmod_mp_e        1222  ESMF_Comp.F90
2679: ufs_model          00000000009DB3B4  esmf_gridcompmod_        1407  ESMF_GridComp.F90
2679: ufs_model          00000000008EE29F  nuopc_driver_mp_l        2565  NUOPC_Driver.F90
2679: ufs_model          0000000000912D9C  nuopc_driver_mp_i        1264  NUOPC_Driver.F90
2679: ufs_model          000000000091BC65  nuopc_driver_mp_i         455  NUOPC_Driver.F90
2679: ufs_model          0000000000A2EADE  _ZN5ESMCI6FTable1        2167  ESMCI_FTable.C
2679: ufs_model          0000000000A32A96  ESMCI_FTableCallE         824  ESMCI_FTable.C
2679: ufs_model          0000000000DA6EAF  _ZN5ESMCI3VMK5ent        2308  ESMCI_VMKernel.C
2679: ufs_model          00000000010AEEBA  _ZN5ESMCI2VM5ente        1216  ESMCI_VM.C
2679: ufs_model          0000000000A30177  c_esmc_ftablecall         981  ESMCI_FTable.C
2679: ufs_model          00000000009462C1  esmf_compmod_mp_e        1222  ESMF_Comp.F90
2679: ufs_model          00000000009DB3B4  esmf_gridcompmod_        1407  ESMF_GridComp.F90
2679: ufs_model          000000000041A976  MAIN__                    381  UFS.F90
2679: ufs_model          00000000004197DE  Unknown               Unknown  Unknown
2679: libc-2.17.so       00002BA8387D4555  __libc_start_main     Unknown  Unknown
2679: ufs_model          00000000004196E9  Unknown               Unknown  Unknown

gfsfcst.log

@DavidHuber-NOAA DavidHuber-NOAA added the bug Something isn't working label Jun 8, 2022
@junwang-noaa
Copy link
Collaborator

@DavidHuber-NOAA We've been running low resolution cpld tests on xjet and haven't seen any issues. What resolution is your run? Can you run the cpld_control_p8 test on xjet to confirm the issue? Thanks

@DavidHuber-NOAA
Copy link
Collaborator Author

@junwang-noaa I'm running at C384 resolution. Sure, I will run the regression test, though I expect it to pass since the executable will also be built on xjet.

@DavidHuber-NOAA
Copy link
Collaborator Author

@junwang-noaa The regression test passed.

As an additional test, I'm going to try running the C384 forecast out 6 hours after recompiling with the -xHOST flag removed from CICE-interface/CMakeLists.txt.

@DavidHuber-NOAA
Copy link
Collaborator Author

@junwang-noaa The C384 forecast successfully ran without the -xHOST flag. If AVX* instructions are required/desired for CICE, perhaps the options from cmake/configure_<machine>.intel.cmake could be used instead?

@DavidHuber-NOAA
Copy link
Collaborator Author

I reran the regression test, compiling on kjet then running cpld_control_p8 on xjet. This caused the same crash I'm seeing when compiled on the head node then running on xjet. The cpld_control_p8 test directory is located here: /lfs1/NESDIS/nesdis-rdo2/David.Huber/RT_RUNDIRS/David.Huber/FV3_RT/rt_200684/cpld_control_p8.

DavidHuber-NOAA added a commit to DavidHuber-NOAA/ufs-weather-model that referenced this issue Jun 28, 2022
@junwang-noaa
Copy link
Collaborator

@DavidHuber-NOAA I am not sure what is the difference between kjet and xjet, have you run any tests on xjet before with the executable compiled on kjet?

@DavidHuber-NOAA
Copy link
Collaborator Author

@junwang-noaa xjet's CPU architecture is Haswell, while kjet's is Skylake. The CPUs support different instruction sets (e.g. AVX-2 on xjet and AVX-512 on kjet). Thus, when CICE is compiled on kjet with -xHOST, it will compile with newer instructions than xjet can perform.

I have run tests on xjet before (not RTs, just forecasts), but in ATM-only and ATMW modes. These ran successfully.

If extra instructions are required for CICE, then I would suggest using the instructions in cmake/configure_<machine>.intel.cmake. Within Jet's cmake file, instructions are set to compile in a way that is compatible with vjet, xjet, and kjet. If this isn't required, then I'd suggest removing -xHOST from CICE's CMakeLists.txt file -- none of the other interfaces specify extra compiler instructions, so I suspect this is the intent, though I would defer to the CICE developers.

DavidHuber-NOAA added a commit to DavidHuber-NOAA/ufs-weather-model that referenced this issue Jan 9, 2023
jkbk2004 added a commit that referenced this issue Jan 20, 2023
* Remove -xHOST from CICE CMakeLists.txt. #1262

Co-authored-by: [David Huber] <[[email protected]]>
Co-authored-by: JONG KIM <[email protected]>
Co-authored-by: Brian Curtis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants