Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSI segfaults with spack-stack 1.4.0 on hercules #643

Closed
jswhit opened this issue Jun 23, 2023 · 12 comments
Closed

GSI segfaults with spack-stack 1.4.0 on hercules #643

jswhit opened this issue Jun 23, 2023 · 12 comments
Assignees
Labels
bug Something is not working INFRA JEDI Infrastructure

Comments

@jswhit
Copy link

jswhit commented Jun 23, 2023

I'm trying to get the NOAA GSI data assimilation system running on hercules. I used the following environment to compile the latest develop on orion using spack-stack 1.4.0

module use /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0/envs/unified-env-v2/install/modulefiles/Core
module use /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0/envs/unified-env-v2/install/modulefiles/intel-oneapi-mpi/2021.5.1/intel/2022.0.2
module load stack_intel
module load stack-intel-oneapi-mpi
module load mkl
module load miniconda
module load cmake
module load bufr
module load bacio
module load w3emc
module load sp
module load ip
module load sigio
module load sfcio
module load nemsio
module load wrf-io
module load ncio
module load crtm
module load gsi-ncdiag

Currently Loaded Modules:
  1) intel/2022.1.2                    8) curl/8.0.1         15) w3emc/2.9.2   22) netcdf-fortran/4.6.0  29) gsi-ncdiag/1.0.0
  2) stack-intel/2022.0.2              9) pkg-config/0.27.1  16) sp/2.3.3      23) wrf-io/1.2.0          
  3) impi/2022.1.2                    10) hdf5/1.14.0        17) ip/3.3.3      24) ncio/1.1.2           
  4) stack-intel-oneapi-mpi/2021.5.1  11) zstd/1.5.2         18) sigio/2.3.2   25) cmake/3.23.1
  5) mkl/2020.2                       12) netcdf-c/4.9.2     19) sfcio/1.4.1   26) crtm-fix/2.4.0_emc
  6) miniconda/4.12.0                 13) bufr/11.7.1        20) w3nco/2.4.1   27) git-lfs/2.12.0
  7) zlib/1.2.13                      14) bacio/2.4.1        21) nemsio/2.5.2  28) crtm/2.4.0

and this environment on hercules

module use /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/envs/unified-env-v2/install/modulefiles/Core
module use /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/envs/unified-env-v2/install/modulefiles/intel-oneapi-mpi/2021.7.1/intel/2021.7.1
module load stack-intel
module load stack-intel-oneapi-mpi
module load intel-oneapi-mkl
module load stack-python
module load cmake
module load bufr
module load bacio
module load w3emc
module load sp
module load ip
module load sigio
module load sfcio
module load nemsio
module load wrf-io
module load ncio
module load crtm
module load gsi-ncdiag

Currently Loaded Modules:
  1) intel-oneapi-compilers/2022.2.1   7) cmake/3.23.1    13) bufr/11.7.1  19) sfcio/1.4.1           25) crtm-fix/2.4.0_emc
  2) stack-intel/2021.7.1              8) zlib/1.2.13     14) bacio/2.4.1  20) w3nco/2.4.1           26) git-lfs/3.1.2
  3) intel-oneapi-mpi/2021.7.1         9) curl/8.0.1      15) w3emc/2.9.2  21) nemsio/2.5.2          27) crtm/2.4.0
  4) stack-intel-oneapi-mpi/2021.7.1  10) hdf5/1.14.0     16) sp/2.3.3     22) netcdf-fortran/4.6.0  28) gsi-ncdiag/1.0.0
  5) intel-oneapi-mkl/2022.2.1        11) zstd/1.5.2      17) ip/3.3.3     23) wrf-io/1.2.0          
  6) stack-python/3.9.14              12) netcdf-c/4.9.2  18) sigio/2.3.2  24) ncio/1.1.2            

The GSI runs to completion on orion, but segfaults with this traceback on hercules. Note that the segfault occurs in the CRTM library, which is the same version on both orion and hercules

[hercules-06-20:129948:0:129948] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-38:425233:0:425233] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-23:122151:0:122151] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffcd18e2010)
[hercules-06-23:122149:0:122149] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-34:135253:0:135253] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-28:123765:0:123765] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-37:204900:0:204900] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-39:222132:0:222132] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-33:122560:0:122560] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-24:125957:0:125957] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-22:123409:0:123409] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-31:145177:0:145177] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffc57ef4008)
[hercules-06-32:120227:0:120227] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
[hercules-06-36:212312:0:212312] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
==== backtrace (tid: 122567) ====
 0 0x0000000000054d90 __GI___sigaction()  :0
 1 0x000000000213ec80 process_allocation_records_deallocate()  for_alloc_copy.c:0
 2 0x000000000213e5f4 do_deallocate_all()  for_alloc_copy.c:0
 3 0x0000000001db3c92 crtm_k_matrix_module_mp_crtm_k_matrix_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-crtm-2.4.0-nauulyni574akzexbjfag3p4e4snaxxi/spack-src/libsrc/CRTM_K_Matrix_Module.f90:481
 4 0x000000000187082d crtm_interface_mp_call_crtm_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/crtm_interface.f90:2157
 5 0x000000000154e082 rad_setup_mp_setuprad_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/setuprad.f90:902
 6 0x000000000116124a gsi_radoper_mp_setup__()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/gsi_radOper.F90:100
 7 0x0000000000b238f0 setuprhsall_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/setuprhsall.f90:490
 8 0x000000000113fb6d glbsoi_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/glbsoi.f90:324
 9 0x0000000000645a27 gsisub_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/gsisub.F90:200
10 0x000000000041633d gsimod_mp_gsimain_run_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/gsimod.F90:2313
11 0x000000000041627f MAIN__()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/gsimain.f90:631
12 0x000000000041621d main()  ???:0
13 0x000000000003feb0 __libc_start_call_main()  ???:0
14 0x000000000003ff60 __libc_start_main_alias_2()  :0
15 0x0000000000416135 _start()  ???:0
=================================
==== backtrace (tid: 123380) ====
 0 0x0000000000054d90 __GI___sigaction()  :0
 1 0x000000000213ec80 process_allocation_records_deallocate()  for_alloc_copy.c:0
 2 0x000000000213e5f4 do_deallocate_all()  for_alloc_copy.c:0
 3 0x0000000001db3c92 crtm_k_matrix_module_mp_crtm_k_matrix_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-crtm-2.4.0-nauulyni574akzexbjfag3p4e4snaxxi/spack-src/libsrc/CRTM_K_Matrix_Module.f90:481
 4 0x000000000187082d crtm_interface_mp_call_crtm_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/crtm_interface.f90:2157
 5 0x000000000154e082 rad_setup_mp_setuprad_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/setuprad.f90:902
 6 0x000000000116124a gsi_radoper_mp_setup__()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/gsi_radOper.F90:100
 7 0x0000000000b238f0 setuprhsall_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/setuprhsall.f90:490
[hercules-06-37:204891:0:204891] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
 8 0x000000000113fb6d glbsoi_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/glbsoi.f90:324
 9 0x0000000000645a27 gsisub_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/gsisub.F90:200
10 0x000000000041633d gsimod_mp_gsimain_run_()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/gsimod.F90:2313

11 0x000000000041627f MAIN__()  /work/noaa/gsienkf/whitaker/GSI-jswhit/src/gsi/gsimain.f90:631
12 0x000000000041621d main()  ???:0
13 0x000000000003feb0 __libc_start_call_main()  ???:0
14 0x000000000003ff60 __libc_start_main_alias_2()  :0
15 0x0000000000416135 _start()  ???:0
=================================
[hercules-06-37:204892:0:204892] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libc.so.6          0000147E36A32D90  Unknown               Unknown  Unknown
global_gsi         000000000213EC80  Unknown               Unknown  Unknown
global_gsi         000000000213E5F4  Unknown               Unknown  Unknown
global_gsi         0000000001DB3C92  crtm_k_matrix_mod         481  CRTM_K_Matrix_Module.f90
global_gsi         000000000187082D  crtm_interface_mp        2157  crtm_interface.f90
global_gsi         000000000154E082  rad_setup_mp_setu         902  setuprad.f90
global_gsi         000000000116124A  gsi_radoper_mp_se         100  gsi_radOper.F90
global_gsi         0000000000B238F0  setuprhsall_              490  setuprhsall.f90
global_gsi         000000000113FB6D  glbsoi_                   324  glbsoi.f90
global_gsi         0000000000645A27  gsisub_                   200  gsisub.F90
global_gsi         000000000041633D  gsimod_mp_gsimain        2313  gsimod.F90
global_gsi         000000000041627F  MAIN__                    631  gsimain.f90
global_gsi         000000000041621D  Unknown               Unknown  Unknown
libc.so.6          0000147E36A1DEB0  Unknown               Unknown  Unknown
libc.so.6          0000147E36A1DF60  __libc_start_main     Unknown  Unknown
global_gsi         0000000000416135  Unknown               Unknown  Unknown

(full stdout and sterr in /work2/noaa/gsienkf/jwhitake/C192_hybcov_hourly_esmda2/2021082922//logs/run_gsianal_1.out).

Not sure if this is a spack-stack issue, a GSI issue, or a hercules issue - but I thought I'd try here first to see if anyone has any suggestions.

@jswhit jswhit added the bug Something is not working label Jun 23, 2023
@climbfuji
Copy link
Collaborator

THat's good to know. I have seen errors (segfaults) related to CRTM on Hercules in the JEDI CI tests as well. I suspect it has to do with the newest Intel compilers being used (hinting to a bug in the CRTM code, most likely).

@climbfuji
Copy link
Collaborator

@BenjaminTJohnson FYI

@climbfuji climbfuji self-assigned this Jun 23, 2023
@climbfuji climbfuji added the INFRA JEDI Infrastructure label Jun 23, 2023
@BenjaminTJohnson
Copy link

I guess my question is why is this just occurring now, when this has been running fine on Hercule, or is this the first time it's being run on Hercules?

@jswhit
Copy link
Author

jswhit commented Jun 23, 2023

I believe I may be the first one to try running GSI on hercules

@jswhit
Copy link
Author

jswhit commented Jun 23, 2023

THat's good to know. I have seen errors (segfaults) related to CRTM on Hercules in the JEDI CI tests as well. I suspect it has to do with the newest Intel compilers being used (hinting to a bug in the CRTM code, most likely).

Note that the same version of the intel compiler is being used in spack-stack 1.4.0 on orion and hercules, and it only crashes on hercules.

Correction: Looks like 2022.1.2 on orion and 2022.2.1 on hercules.

@climbfuji
Copy link
Collaborator

climbfuji commented Jun 23, 2023

That's good to know. Hercules is also a much newer OS with many newer libraries (and a newer GCC?/G++ backend) than Orion. We'll have to keep digging.

@climbfuji
Copy link
Collaborator

We installed a new version of the stack using the latest intel compilers, can you try that please?

module purge
module use /work/noaa/epic/role-epic/spack-stack/hercules/modulefiles
module load ecflow/5.8.4
module load mysql/8.0.31

# For spack-stack-dev-20230717 with Intel, load the following modules after loading miniconda and ecflow:

module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-dev-20230717/envs/unified-env/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.9.14
module available

@jswhit
Copy link
Author

jswhit commented Jul 24, 2023

Just tried to compile GSI with that, but it's missing bufr_d (only bufr_4 in /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-dev-20230717/envs/unified-env/install/intel/2021.9.0/bufr-12.0.0-kfamcwl/include/)

@climbfuji
Copy link
Collaborator

GSI will need to be updated to work with bufr@12 - the bufr@12 release notes describe how this can be done. In the meanwhile, an older version of bufr can be and is installed - see #687. Please check if this works for you.

@climbfuji
Copy link
Collaborator

Once spack-stack 1.5.0 is installed on Hercules and we have confirmation that GSI works, we can close this issue as completed.

@climbfuji
Copy link
Collaborator

@jswhit spack-stack-1.5.0 is available on Hercules. Would you mind testing GSI on the system with Intel and GNU? Thanks!

@climbfuji
Copy link
Collaborator

We haven't heard back if testing with spack-stack 1.5.0 was successful or not. Closing this issue, if there are still problems with newer spack-stack versions then please open a new issue.

@climbfuji climbfuji closed this as not planned Won't fix, can't repro, duplicate, stale Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working INFRA JEDI Infrastructure
Projects
None yet
Development

No branches or pull requests

5 participants