Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal failure while writing restart files with fine dimensionality #613

Open
JoshuaRady opened this issue Feb 20, 2020 · 3 comments
Open

Comments

@JoshuaRady
Copy link

Running CLM-FATES version fates_s1.31.0_api.8.0.0 with increased size bins (fates_history_sizeclass_bin_edges & fates_history_height_bin_edges, 302 values each) simulations sometimes crash while writing the restart files.

Whether or not a crash occurs depends in some way on parameter file differences. I have not been able to determine which parameters are associated with the failure as I am running a single point in multi-instance mode and the offending process results in the early termination of some other simulations before they try to write their restart files. Most simulations that do finish write their restart files successfully.

These are one-off simulations so I have been changing XML setting REST_OPTION=never. However, I imagine this could present a potential problem in the future. This seems like a low priority issue for the community but I wanted people to be aware of it.

An example stack trace, which I don't find very informative, is:
...
260551 9:MPT: #2 MPI_SGI_stacktraceback (
260552 9:MPT: header=header@entry=0x7ffcf009da40 "MPT ERROR: Rank 9(g:9) received signal SIGSEGV(11).\n\tProcess ID: 24926, Host: r11i4n21, Program: /glade/scratch/jmrady/FATES_VTSpacingTrial_Halif
axCoNC_AllPlots_LLpftP_2/bld/cesm.exe\n\tMPT Version: HPE"...) at sig.c:340
260553 9:MPT: #3 0x00002b68eed07fb2 in first_arriver_handler (signo=signo@entry=11,
260554 9:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b68f9300080) at sig.c:489
260555 9:MPT: #4 0x00002b68eed0834b in slave_sig_handler (signo=11,
260556 9:MPT: siginfo=, extra=) at sig.c:564
260557 9:MPT: #5
260558 9:MPT: #6 0x000000000081844a in fatesrestartinterfacemod_mp_set_restart_vectors_ ()
260559 9:MPT: at /glade/work/jmrady/ClmVersions/fates_s1.31.0_api.8.0.0/cime/../src/fates/main/FatesRestartInterfaceMod.F90:1808
260560 9:MPT: #7 0x0000000000521981 in clmfatesinterfacemod_mp_restart_ ()
...

@rgknox
Copy link
Contributor

rgknox commented Feb 20, 2020

@JoshuaRady is there anything informative in the cesm or lnd run logs in the run directory?

@JoshuaRady
Copy link
Author

The stack trace above is from the CESM log of the crashing process (9 in this case). The land logs all end uninformatively wherever they happened to be with no error messages.

@JoshuaRady
Copy link
Author

Digging back through the history of this issue I (re)found a case where the CESM log file provided a more informative stack trace. It makes clear what it happening but I don't know why it only happens with some instances and not others.

340268 4: NetCDF: Index exceeds dimension bound
340269 4: pio_support::pio_die:: myrank= -1 : ERROR:
340270 4: pionfwrite_mod::write_nfdarray_int: 250 :
340271 4: NetCDF: Index exceeds dimension bound
340272 4:Image PC Routine Line Source
340273 4:cesm.exe 00000000015C74FD Unknown Unknown Unknown
340274 4:cesm.exe 0000000000EE6191 pio_support_mp_pi 118 pio_support.F90
340275 4:cesm.exe 0000000000EE43BE pio_utils_mp_chec 74 pio_utils.F90
340276 4:cesm.exe 0000000000FE95FA pionfwrite_mod_mp 250 pionfwrite_mod.F90.in
340277 4:cesm.exe 0000000000FAF46F piodarray_mp_writ 650 piodarray.F90.in
340278 4:cesm.exe 0000000000FB17C4 piodarray_mp_writ 221 piodarray.F90.in
340279 4:cesm.exe 00000000005E887C ncdio_pio_mp_ncd_ 1657 ncdio_pio.F90.in
340280 4:cesm.exe 0000000000618C81 restutilmod_mp_re 344 restUtilMod.F90.in
340281 4:cesm.exe 000000000052221F clmfatesinterface 1103 clmfates_interfaceMod.F90
340282 4:cesm.exe 0000000000509429 clm_instmod_mp_cl 543 clm_instMod.F90
340283 4:cesm.exe 000000000060BDE6 restfilemod_mp_re 119 restFileMod.F90
340284 4:cesm.exe 00000000004FEC23 clm_driver_mp_clm 1168 clm_driver.F90
340285 4:cesm.exe 00000000004EBFD0 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90
340286 4:cesm.exe 0000000000425A58 component_mod_mp_ 724 component_mod.F90
340287 4:cesm.exe 0000000000409C2A cime_comp_mod_mp_ 2447 cime_comp_mod.F90
340288 4:cesm.exe 00000000004256EC MAIN__ 133 cime_driver.F90
340289 4:cesm.exe 0000000000407EDE Unknown Unknown Unknown
340290 4:libc.so.6 00002B6D6D07B6E5 __libc_start_main Unknown Unknown
340291 4:cesm.exe 0000000000407DE9 Unknown Unknown Unknown
340292 4:MPT ERROR: Rank 4(g:4) is aborting with error code 1.
340293 4: Process ID: 12893, Host: r14i7n7, Program: /glade/scratch/jmrady/FATES_VTSpacingTrial_KingAndQueen_AllPlots_LLpftP_2/bld/cesm.exe
340294 4: MPT Version: HPE MPT 2.19 02/23/19 05:30:09
340295 4:
340296 4:MPT: --------stack traceback-------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ❕Todo
Development

No branches or pull requests

2 participants