Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WE2E test "grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16" fails with segmentation fault at run_fcst step #731

Closed
mkavulich opened this issue Apr 6, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@mkavulich
Copy link
Collaborator

Description

The WE2E test "grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16" currently fails with the most up-to-date hashes for all UFS components (ufs-community/ufs-srweather-app@2b09220).

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

  1. On Hera, build the latest UFS_SRW_App and run the grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 WE2E test
  2. observe failure at the run_fcst step

Additional Context

This failure has been observed on Hera and Orion, but likely occurs on all platforms.

Output

The failure manifested as a segmentation fault for me with no helpful error messages:

0 FORECAST DATE          20 MAY   2019 AT  1 HRS  0.00 MINS
  JULIAN DAY             2458623  PLUS   0.541667
  RADIUS VECTOR          1.0117708
  RIGHT ASCENSION OF SUN   3.7696072 HRS, OR   3 HRS  46 MINS  10.6 SECS
  DECLINATION OF THE SUN  19.8837416 DEGS, OR   19 DEGS  53 MINS   1.5 SECS
  EQUATION OF TIME         3.4600539 MINS, OR    207.60 SECS, OR 0.015139 RADIANS
  SOLAR CONSTANT        1329.5560131 (DISTANCE AJUSTED)


    for cosz calculations: nswr,deltim,deltsw,dtswh =          90
   40.0000000000000        3600.00000000000        1.00000000000000
   anginc,nstp =  2.908882086657215E-003          90
 PASS: fcstRUN phase 1, n_atmsteps =           90  time is
   3.49750709533691
 PASS: fcstRUN phase 2, n_atmsteps =           90  time is
  0.369647979736328
 PASS: fcstRUN phase 1, n_atmsteps =           91  time is
   1.88833403587341
 PASS: fcstRUN phase 2, n_atmsteps =           91  time is
  0.134248018264771
 PASS: fcstRUN phase 1, n_atmsteps =           92  time is
   1.89364600181580
 PASS: fcstRUN phase 2, n_atmsteps =           92  time is
  0.130100011825562
 PASS: fcstRUN phase 1, n_atmsteps =           93  time is
   1.88795089721680
 PASS: fcstRUN phase 2, n_atmsteps =           93  time is
  0.134515047073364
 PASS: fcstRUN phase 1, n_atmsteps =           94  time is
   1.88999199867249
 PASS: fcstRUN phase 2, n_atmsteps =           94  time is
  0.138705968856812
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
srun: error: h17c50: tasks 3-4,9: Exited with exit code 174
srun: launch/slurm: _step_signal: Terminating StepId=30229343.0
slurmstepd: error: *** STEP 30229343.0 ON h17c50 CANCELLED AT 2022-04-05T00:08:08 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
ufs_model          000000000406E26F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B32B316A630  Unknown               Unknown  Unknown
libmpi.so.12       00002B32B2602D0C  PMPIDI_CH3I_Progr     Unknown  Unknown
libmpi.so.12.0     00002B32B28418FD  Unknown               Unknown  Unknown
libmpi.so.12       00002B32B28F2C42  PMPI_Probe            Unknown  Unknown
ufs_model          000000000082F20A  _ZN5ESMCI3VMK4rec        4485  ESMCI_VMKernel.C
ufs_model          0000000000F44479  _ZN5ESMCI3XXE4exe        4085  ESMCI_DELayout.C
ufs_model          0000000000F42BE8  _ZN5ESMCI3XXE4exe        5409  ESMCI_DELayout.C
ufs_model          0000000001360FC6  _ZN5ESMCI11ArrayB        1680  ESMCI_ArrayBundle.C
ufs_model          0000000000A3CFB2  c_esmc_arraybundl         717  ESMCI_ArrayBundle_F.C
ufs_model          0000000000755372  esmf_arraybundlem        2945  ESMF_ArrayBundle.F90
ufs_model          000000000071ADCA  esmf_fieldbundlem       16641  ESMF_FieldBundle.F90
ufs_model          000000000071A750  esmf_fieldbundlem       15390  ESMF_FieldBundle.F90
ufs_model          0000000001AD74FF  fv3gfs_cap_mod_mp         922  fv3_cap.F90
ufs_model          0000000001AD6101  fv3gfs_cap_mod_mp         804  fv3_cap.F90

etc. etc. The full log file for this failing test can be found on Hera at /scratch2/BMC/det/kavulich/workdir/update_app_hashes/expt_dirs/grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16/log/run_fcst_2019052000.log

@chan-hoo reports that this failure can also show the following error message prior to segfault:

FATAL from PE 4: compute_qs: saturation vapor pressure table overflow, nbad= 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant