Fatal failure while writing restart files with fine dimensionality #613

JoshuaRady · 2020-02-20T22:12:36Z

Running CLM-FATES version fates_s1.31.0_api.8.0.0 with increased size bins (fates_history_sizeclass_bin_edges & fates_history_height_bin_edges, 302 values each) simulations sometimes crash while writing the restart files.

Whether or not a crash occurs depends in some way on parameter file differences. I have not been able to determine which parameters are associated with the failure as I am running a single point in multi-instance mode and the offending process results in the early termination of some other simulations before they try to write their restart files. Most simulations that do finish write their restart files successfully.

These are one-off simulations so I have been changing XML setting REST_OPTION=never. However, I imagine this could present a potential problem in the future. This seems like a low priority issue for the community but I wanted people to be aware of it.

An example stack trace, which I don't find very informative, is:
...
260551 9:MPT: #2 MPI_SGI_stacktraceback (
260552 9:MPT: header=header@entry=0x7ffcf009da40 "MPT ERROR: Rank 9(g:9) received signal SIGSEGV(11).\n\tProcess ID: 24926, Host: r11i4n21, Program: /glade/scratch/jmrady/FATES_VTSpacingTrial_Halif
axCoNC_AllPlots_LLpftP_2/bld/cesm.exe\n\tMPT Version: HPE"...) at sig.c:340
260553 9:MPT: #3 0x00002b68eed07fb2 in first_arriver_handler (signo=signo@entry=11,
260554 9:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b68f9300080) at sig.c:489
260555 9:MPT: #4 0x00002b68eed0834b in slave_sig_handler (signo=11,
260556 9:MPT: siginfo=, extra=) at sig.c:564
260557 9:MPT: #5
260558 9:MPT: #6 0x000000000081844a in fatesrestartinterfacemod_mp_set_restart_vectors_ ()
260559 9:MPT: at /glade/work/jmrady/ClmVersions/fates_s1.31.0_api.8.0.0/cime/../src/fates/main/FatesRestartInterfaceMod.F90:1808
260560 9:MPT: #7 0x0000000000521981 in clmfatesinterfacemod_mp_restart_ ()
...

rgknox · 2020-02-20T22:17:43Z

@JoshuaRady is there anything informative in the cesm or lnd run logs in the run directory?

JoshuaRady · 2020-02-20T22:33:39Z

The stack trace above is from the CESM log of the crashing process (9 in this case). The land logs all end uninformatively wherever they happened to be with no error messages.

JoshuaRady · 2020-03-19T19:16:56Z

Digging back through the history of this issue I (re)found a case where the CESM log file provided a more informative stack trace. It makes clear what it happening but I don't know why it only happens with some instances and not others.

340268 4: 340269 4: 340270 4: 340271 4: 340272 4:Image 340273 4:cesm.exe 340274 4:cesm.exe 340275 4:cesm.exe 340276 4:cesm.exe 340277 4:cesm.exe 340278 4:cesm.exe 340279 4:cesm.exe 340280 4:cesm.exe 340281 4:cesm.exe 340282 4:cesm.exe 340283 4:cesm.exe 340284 4:cesm.exe 340285 4:cesm.exe 340286 4:cesm.exe 340287 4:cesm.exe 340288 4:cesm.exe 340289 4:cesm.exe 340290 4:libc.so.6 340291 4:cesm.exe 340292 4:MPT 340293 4: 340294 4: 340295 4:
340296 4:MPT: NetCDF: Index exceeds dimension bound
pio_support::pio_die:: myrank= -1 : ERROR:
pionfwrite_mod::write_nfdarray_int: 250 :
NetCDF: Index exceeds dimension bound
PC Routine Line Source
00000000015C74FD Unknown Unknown Unknown
0000000000EE6191 pio_support_mp_pi 118 pio_support.F90
0000000000EE43BE pio_utils_mp_chec 74 pio_utils.F90
0000000000FE95FA pionfwrite_mod_mp 250 pionfwrite_mod.F90.in
0000000000FAF46F piodarray_mp_writ 650 piodarray.F90.in
0000000000FB17C4 piodarray_mp_writ 221 piodarray.F90.in
00000000005E887C ncdio_pio_mp_ncd_ 1657 ncdio_pio.F90.in
0000000000618C81 restutilmod_mp_re 344 restUtilMod.F90.in
000000000052221F clmfatesinterface 1103 clmfates_interfaceMod.F90
0000000000509429 clm_instmod_mp_cl 543 clm_instMod.F90
000000000060BDE6 restfilemod_mp_re 119 restFileMod.F90
00000000004FEC23 clm_driver_mp_clm 1168 clm_driver.F90
00000000004EBFD0 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90
0000000000425A58 component_mod_mp_ 724 component_mod.F90
0000000000409C2A cime_comp_mod_mp_ 2447 cime_comp_mod.F90
00000000004256EC MAIN__ 133 cime_driver.F90
0000000000407EDE Unknown Unknown Unknown
00002B6D6D07B6E5 __libc_start_main Unknown Unknown
0000000000407DE9 Unknown Unknown Unknown
ERROR: Rank 4(g:4) is aborting with error code 1.
Process ID: 12893, Host: r14i7n7, Program: /glade/scratch/jmrady/FATES_VTSpacingTrial_KingAndQueen_AllPlots_LLpftP_2/bld/cesm.exe
MPT Version: HPE MPT 2.19 02/23/19 05:30:09
--------stack traceback-------

glemieux added this to FATES issue board Nov 17, 2022

glemieux moved this to ❕Todo in FATES issue board Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal failure while writing restart files with fine dimensionality #613

Fatal failure while writing restart files with fine dimensionality #613

JoshuaRady commented Feb 20, 2020

rgknox commented Feb 20, 2020

JoshuaRady commented Feb 20, 2020

JoshuaRady commented Mar 19, 2020

Fatal failure while writing restart files with fine dimensionality #613

Fatal failure while writing restart files with fine dimensionality #613

Comments

JoshuaRady commented Feb 20, 2020

rgknox commented Feb 20, 2020

JoshuaRady commented Feb 20, 2020

JoshuaRady commented Mar 19, 2020