Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvhpc compiler tests are failing on cheyenne/derecho #1733

Open
ekluzek opened this issue Apr 30, 2022 · 12 comments
Open

nvhpc compiler tests are failing on cheyenne/derecho #1733

ekluzek opened this issue Apr 30, 2022 · 12 comments
Labels
bug something is working incorrectly priority: low Background task that doesn't need to be done right away.

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 30, 2022

Brief summary of bug

MPI tests with DEBUG on are failing at runtime with the nvhpc compiler on cheyenne.
This continues in ctsm5.1.dev155-38-g5c8f17b1a (derecho1 branch) on derecho

General bug information

CTSM version you are using: ctsm5.1.dev082 in cesm2_3_alpha08d

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: tests with nvhpc and DEBUG on

Details of bug

These tests fail:

SMS_D.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_D.f45_f45_mg37.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default
SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default

While DEBUG off tests PASS:

SMS.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart
SMS_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default

As well as mpi-serial tests:

SMS_D_Ld1_Mmpi-serial.1x1_brazil.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Ld1_Mmpi-serial.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default
SMS_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef
SMS_D_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default
SMS_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default

Important details of your setup / configuration so we can reproduce the bug

Important output or errors that show the problem

For the smallest case: SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default

The only log file available is the cesm.log file as follows.

cesm.log file:

 (t_initf)       profile_single_file=       F
 (t_initf)       profile_global_stats=      T
 (t_initf)       profile_ovhd_measurement=  F
 (t_initf)       profile_add_detail=        F
 (t_initf)       profile_papi_enable=       F
[r12i4n4:35002:0:35002] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35003:0:35003] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35004:0:35004] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35006:0:35006] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35007:0:35007] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35008:0:35008] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35010:0:35010] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35011:0:35011] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35012:0:35012] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35013:0:35013] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35014:0:35014] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35015:0:35015] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35017:0:35017] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35018:0:35018] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35019:0:35019] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35020:0:35020] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35022:0:35022] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35000:0:35000] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35001:0:35001] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35016:0:35016] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 21 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
==== backtrace (tid:  35022) ====
 0  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2ba9d97301a4]
 1  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a4cc) [0x2ba9d97304cc]
 2  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a73b) [0x2ba9d973073b]
 3  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6LogErr13MsgFoundErrorEiPKciS2_S2_Pi+0x34) [0x2ba9b78f4c74]
 4  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI7MeshCap22meshcreatenodedistgridEPi+0x7f) [0x2ba9b7b15ebf]
 5  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_meshcreatenodedistgrid_+0xc1) [0x2ba9b7b61141]
 6  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshaddelements_+0xbc0) [0x2ba9b881c880]
 7  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreatefromunstruct_+0x4d0f) [0x2ba9b88246cf]
 8  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreatefromfile_+0x270) [0x2ba9b881f270]
 9  /glade/scratch/erik/SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default.GC.cesm2_3_alpha8achlist/bld/cesm.exe() [0x15d8fd0]
10  /glade/scratch/erik/SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default.GC.cesm2_3_alpha8achlist/bld/cesm.exe() [0x632341]
11  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc30) [0x2ba9b77436b0]
12  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ba9b773e913]
13  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ba9b7f7b9fb]
14  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ba9b7fa3bbe]
15  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ba9b773edd3]
16  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xa26) [0x2ba9b82d2c66]
17  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ba9b85a5ede]
@glemieux
Copy link
Collaborator

glemieux commented Aug 9, 2023

Updating to ccs_config_cesm0.0.65 via #2000 now results in all the nvhpc tests on cheyenne failing run times. It is expected that updating to cesm2_3_beta15 will resolve this.

@glemieux glemieux changed the title MPI tests with DEBUG on are failing with nvhpc compiler on cheyenne nvhpc compiler tests are failing on cheyenne Aug 9, 2023
@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 7, 2023

In the CESM3_dev branch two of the tests now PASS:

SMS.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_nvhpc.clm-crop FAILED PREVIOUSLY
SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.cheyenne_nvhpc.clm-FatesColdSatPhen FAILED PREVIOUSLY

While this one still fails, but now with a floating point exception

SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop EXPECTED

The cesm.log file shows that there is a problem in ESMF at initialization in creating an ESMF mesh. It doesn't drop PET files by default in this case...

cesm.log:

[1,0]<stderr>: (t_initf)       profile_papi_enable=       F
[1,0]<stdout>: /glade/work/erik/ctsm_worktrees/cesm3_dev/share/src/shr_file_mod.F90
[1,0]<stdout>:          912 This routine is depricated - use shr_log_setLogUnit instead
[1,0]<stdout>:          -18
[1,0]<stdout>: /glade/work/erik/ctsm_worktrees/cesm3_dev/share/src/shr_file_mod.F90
[1,0]<stdout>:          912 This routine is depricated - use shr_log_setLogUnit instead
[1,0]<stdout>:          -25
[1,0]<stderr>:[r3i7n18:45933:0:45933] Caught signal 8 (Floating point exception: floating-point invalid operation)
[1,36]<stderr>:[r3i7n33:33507:0:33507] Caught signal 8 (Floating point exception: floating-point invalid operation)
[1,0]<stderr>:==== backtrace (tid:  45933) ====
[1,0]<stderr>: 0  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(ucs_handle_error+0x134) [0x2ae710b0fd74]
[1,0]<stderr>: 1  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(+0x2e0dc) [0x2ae710b100dc]
[1,0]<stderr>: 2  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(+0x2e463) [0x2ae710b10463]
[1,0]<stderr>: 3  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmca_common_ompio.so.41(mca_common_ompio_simple_grouping+0xe4) [0x2ae71fa93a64]
[1,0]<stderr>: 4  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmca_common_ompio.so.41(mca_common_ompio_set_view+0x937) [0x2ae71fa9c877]
[1,0]<stderr>: 5  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_set_view+0xc7) [0x2ae720cf2347]
[1,0]<stderr>: 6  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmpi.so.40(PMPI_File_set_view+0x1a4) [0x2ae6f30a68e4]
[1,0]<stderr>: 7  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_file_set_view+0x161) [0x2ae6f034d4a1]
[1,0]<stderr>: 8  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e28e2) [0x2ae6f032b8e2]
[1,0]<stderr>: 9  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e1469) [0x2ae6f032a469]
[1,0]<stderr>:10  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e02c6) [0x2ae6f03292c6]
[1,0]<stderr>:11  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32df9d2) [0x2ae6f03289d2]
[1,0]<stderr>:12  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_wait+0x9f) [0x2ae6f032855f]
[1,0]<stderr>:13  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_get_varn+0x9f) [0x2ae6f032781f]
[1,0]<stderr>:14  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpi_get_varn_all+0x2d7) [0x2ae6f02be097]
[1,0]<stderr>:15  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x1a758be]
[1,0]<stderr>:16  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe(PIOc_read_darray+0x413) [0x1a72c53]
[1,0]<stderr>:17  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z37get_numElementConn_from_ESMFMesh_fileiiPcxiPxRPi+0x48e) [0x2ae6ee1c7d8e]
[1,0]<stderr>:18  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z42get_elemConn_info_2Dvar_from_ESMFMesh_fileiiPcxiPiRiRS0_S2_+0x99) [0x2ae6ee1c9c19]
[1,0]<stderr>:19  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z36get_elemConn_info_from_ESMFMesh_fileiiPcxiPiRiRS0_S2_+0x28c) [0x2ae6ee1caa4c]
[1,0]<stderr>:20  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z36ESMCI_mesh_create_from_ESMFMesh_fileiPcb18ESMC_CoordSys_FlagPN5ESMCI8DistGridEPPNS1_4MeshE+0x63a) [0x2ae6ee6bc87a]
[1,0]<stderr>:21  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z27ESMCI_mesh_create_from_filePc20ESMC_FileFormat_Flagbb18ESMC_CoordSys_Flag17ESMC_MeshLoc_FlagS_PN5ESMCI8DistGridES5_PPNS3_4MeshEPi+0x2eb) [0x2ae6ee6bb8eb]
[1,0]<stderr>:22  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI7MeshCap21meshcreatefromfilenewEPc20ESMC_FileFormat_Flagbb18ESMC_CoordSys_Flag17ESMC_MeshLoc_FlagS1_PNS_8DistGridES6_Pi+0x99) [0x2ae6ee675919]
[1,0]<stderr>:23  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_meshcreatefromfile_+0x1a7) [0x2ae6ee6c51a7]
[1,0]<stderr>:24  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreat[1,0]<stderr>:efromfile_+0x217) [0x2ae6ef401fd7]
[1,0]<stderr>:25  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x17668d1]
[1,0]<stderr>:26  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x61af01]
[1,0]<stderr>:27  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc3c) [0x2ae6ee25633c]
[1,0]<stderr>:28  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ae6ee251953]
[1,0]<stderr>:29  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ae6eeaf82fb]
[1,0]<stderr>:30  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ae6eeb2237e]
[1,0]<stderr>:31  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ae6ee251e13]
[1,0]<stderr>:32  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xab0) [0x2ae6eee59870]
[1,0]<stderr>:33  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ae6ef13f35e]
[1,0]<stderr>:34  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(nuopc_driver_loopmodelcompss_+0x1036) [0x2ae6ef8ad876]
[1,0]<stderr>:35  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(nuopc_driver_initializeipdv02p3_+0x2208) [0x2ae6ef89fcc8]
[1,0]<stderr>:36  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc3c) [0x2ae6ee25633c]
[1,0]<stderr>:37  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ae6ee251953]
[1,0]<stderr>:38  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ae6eeaf82fb]
[1,0]<stderr>:39  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ae6eeb2237e]
[1,0]<stderr>:40  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ae6ee251e13]
[1,0]<stderr>:41  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xab0) [0x2ae6eee59870]
[1,0]<stderr>:42  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ae6ef13f35e]

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 1, 2023

Seeing similar errors on Derecho:

These PASS:
SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop
SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen

These FAIL:
ERP_D_P128x2_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default
ERS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default
SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.derecho_nvhpc.clm-crop
SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50SpRs.derecho_nvhpc.clm-ptsRLA
SMS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default

The fails are all in the build now with error message from FATES code like this:

Lowering Error: symbol hlm_pft_map$sd is an inconsistent array descriptor
NVFORTRAN-F-0000-Internal compiler error. Errors in Lowering       1  (/glade/work/erik/ctsm_worktrees/external_updates/src/fates/main/EDPftvarcon.F90: 2191)
NVFORTRAN/x86-64 Linux 23.5-0: compilation aborted
gmake: *** [/glade/derecho/scratch/erik/tests_ctsm51d155derechoacl/SMS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default.GC.ctsm51d155derechoacl_nvh/Tools/Makefile:978: EDPftvarcon.o] Error 2
gmake: *** Waiting for unfinished jobs....

Looking at the code I don't see an obvious problem. I googled about it and there are some NVIDIA nvhpc reports about these kind of errors. But, it's not obvious what the issue is here or how to fix it.

@ekluzek ekluzek changed the title nvhpc compiler tests are failing on cheyenne nvhpc compiler tests are failing on cheyenne/derecho Dec 1, 2023
@ekluzek
Copy link
Collaborator Author

ekluzek commented Apr 16, 2024

A reminder that nvhpc is important for the flexibility to be able to start using GPU's, and since Derecho has NVIDIA GPU's nvhpc is going to be the most performant compiler on Derecho for it's GPU's.

Even though GPU's don't currently look like they are important for most uses of CTSM. This will be important for ultra high resolution. And in the future as hardware changes it's important to have flexibility in the model to take advantage of different types of hardware in order to keep the model working well.

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Apr 16, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Apr 16, 2024

Corrected that Derecho has NVIDIA GPU's.. And from talking with @sherimickelson and slides presented by her group on Sep/12th/2023 CSEG meeting, nvhpc and cray compilers work for the Derecho GPU's, but intel-oneapi wasn't at the time.

@ekluzek ekluzek added priority: low Background task that doesn't need to be done right away. and removed next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Apr 24, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Apr 24, 2024

We talked about this in the CSEG meeting. The takeaways are:

Jim feels that we do want to test with NVHPC, so that we know if things start failing. If we need to write a bug report, we can do that, and then move on.
Brian: agrees that testing with it is good, but supporting nvhpc shouldn’t be a requirement for CESM3.

@sherimickelson
Copy link

This is great news and thanks, @ekluzek for sharing this and for your support.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 3, 2024

In what will be ctsm5.3.014 the test SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop now passes the BUILD phase and fails at RUN, pretty early in the mediator.

cesm.log:

dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf) Read in prof_inparm namelist from: drv_in
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf) Using profile_disable=           F
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_timer=                       4
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_depth_limit=                12
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_detail_limit=                2
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_barrier=           F
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_outpe_num=                   1
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_outpe_stride=                0
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_single_file=       F
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_global_stats=      T
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_ovhd_measurement=  F
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_add_detail=        F
dec2441.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_papi_enable=       F
dec2441.hsn.de.hpc.ucar.edu: rank 0 died from signal 8
dec2441.hsn.de.hpc.ucar.edu: rank 109 died from signal 15

@sherimickelson
Copy link

Thanks for the update @ekluzek . Which version of the nvhpc compiler are you using?

@briandobbins
Copy link
Contributor

That signal 8 is an FPE exception -- maybe a divide-by-zero. If running in debug mode, there should be a traceback.

Ordinarily, I'd guess this is a bug in the code, not the compiler, but with NVHPC it's more iffy.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 3, 2024

@sherimickelson that version is using ccs_config_cesm1.0.10 which uses nvhpc/24.3

There are two tests that PASS and have been passing as well:

SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen
SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop

@briandobbins that's all the traceback it gives. So not much to go off of.

One next thing I'd like to try is to run a bunch more tests (maybe all of them?) to see what works and what doesn't....

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 12, 2024

In what will be ctsm5.3.015 there's a test that goes back to failing at the build step, because nvhpc fails at compiling one file:

ftn -c -I. -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/sharedlibroot.ctsm5314rpointeracl_nvh/nvhpc/mpich/debug/nothreads/CDEPS/fox/include -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/sharedlibroot.ctsm5314rpointeracl_nvh/nvhpc/mpich/debug/nothreads/CDEPS/dshr -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.ctsm5314rpointeracl_nvh/bld/nvhpc/mpich/debug/nothreads/include -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.ctsm5314rpointeracl_nvh/bld/nvhpc/mpich/debug/nothreads/include -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.ctsm5314rpointeracl_nvh/bld/nvhpc/mpich/debug/nothreads/finclude -I/glade/u/apps/derecho/23.09/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.27/nvhpc/24.3/zlnp/include -I/glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/nvhpc/24.3/jqjr/include -I/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/nvhpc-24.3/parallelio-2.6.2-nlkasrv3nxk2px4fcqs4a2qgrqsqzvhc/include -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.ctsm5314rpointeracl_nvh/bld/nvhpc/mpich/debug/nothreads/include -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.ctsm5314rpointeracl_nvh/bld/nvhpc/mpich/debug/nothreads/include -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/sharedlibroot.ctsm5314rpointeracl_nvh/nvhpc/mpich/debug/nothreads/clm/obj -I. -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.ctsm5314rpointeracl_nvh/SourceMods/src.clm -I/glade/work/erik/ctsm_worktrees/quickfix/src/cpl/nuopc -I/glade/work/erik/ctsm_worktrees/quickfix/src/main -I/glade/work/erik/ctsm_worktrees/quickfix/src/biogeophys -I/glade/work/erik/ctsm_worktrees/quickfix/src/biogeochem -I/glade/work/erik/ctsm_worktrees/quickfix/src/soilbiogeochem -I/glade/work/erik/ctsm_worktrees/quickfix/src/dyn_subgrid -I/glade/work/erik/ctsm_worktrees/quickfix/src/init_interp -I/glade/work/erik/ctsm_worktrees/quickfix/src/self_tests -I/glade/work/erik/ctsm_worktrees/quickfix/src/fates -I/glade/work/erik/ctsm_worktrees/quickfix/src/fates/main -I/glade/work/erik/ctsm_worktrees/quickfix/src/fates/biogeophys -I/glade/work/erik/ctsm_worktrees/quickfix/src/fates/biogeochem -I/glade/work/erik/ctsm_worktrees/quickfix/src/fates/fire -I/glade/work/erik/ctsm_worktrees/quickfix/src/fates/parteh -I/glade/work/erik/ctsm_worktrees/quickfix/src/fates/radiation -I/glade/work/erik/ctsm_worktrees/quickfix/src/utils -I/glade/work/erik/ctsm_worktrees/quickfix/src/cpl -I/glade/work/erik/ctsm_worktrees/quickfix/src/cpl/utils -I/glade/work/erik/ctsm_worktrees/quickfix/src/cpl/share_esmf -I/glade/derecho/scratch/erik/tests_ctsm5314rpointeracl/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.ctsm5314rpointeracl_nvh/bld/lib/include -Mnofma -i4 -gopt -time -Mextend -byteswapio -Mflushz -Kieee -O0 -g -Ktrap=fp -Mbounds -Kieee -I/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/nvhpc-24.3/esmf-8.6.1-eo57gfoklavco73jpctgcnqdt36ads7c/include -I/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/nvhpc-24.3/esmf-8.6.1-eo57gfoklavco73jpctgcnqdt36ads7c/include -I/glade/u/apps/derecho/23.09/spack/opt/spack/netcdf-c/4.9.2/cray-mpich/8.1.27/nvhpc/24.3/3c7o/include -I/glade/u/apps/derecho/23.09/spack/opt/spack/netcdf-fortran/4.6.1/cray-mpich/8.1.27/nvhpc/24.3/i6rj/include -I/glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/cray-mpich/8.1.27/nvhpc/24.3/dcds/include -DCNL -DCESMCOUPLED -DFORTRANUNDERSCORE -DNO_SHR_VMATH -DNO_R16 -DCPRPGI -DLINUX -DHAVE_GETTID -DDEBUG -DUSE_ESMF_LIB -DHAVE_MPI -DNUOPC_INTERFACE -DPIO2 -DHAVE_SLASHPROC -D_PNETCDF -DESMF_VERSION_MAJOR=8 -DESMF_VERSION_MINOR=6 -DATM_PRESENT -DICE_PRESENT -DLND_PRESENT -DOCN_PRESENT -DROF_PRESENT -DGLC_PRESENT -DWAV_PRESENT -DESP_PRESENT -DMED_PRESENT -DPIO2 -Mfree -craype-verbose -DUSE_CONTIGUOUS= /glade/work/erik/ctsm_worktrees/quickfix/src/dyn_subgrid/dynColumnStateUpdaterMod.F90

It's fails at that point and doesn't give any insight into why. I've added everything to the "ftn" interface that I can see. There must be nvhpc specific arguments that can be given using the "-M" syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly priority: low Background task that doesn't need to be done right away.
Projects
None yet
Development

No branches or pull requests

4 participants