-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal failure while writing restart files with fine dimensionality #613
Comments
@JoshuaRady is there anything informative in the cesm or lnd run logs in the run directory? |
The stack trace above is from the CESM log of the crashing process (9 in this case). The land logs all end uninformatively wherever they happened to be with no error messages. |
Digging back through the history of this issue I (re)found a case where the CESM log file provided a more informative stack trace. It makes clear what it happening but I don't know why it only happens with some instances and not others. 340268 4: NetCDF: Index exceeds dimension bound |
Running CLM-FATES version fates_s1.31.0_api.8.0.0 with increased size bins (fates_history_sizeclass_bin_edges & fates_history_height_bin_edges, 302 values each) simulations sometimes crash while writing the restart files.
Whether or not a crash occurs depends in some way on parameter file differences. I have not been able to determine which parameters are associated with the failure as I am running a single point in multi-instance mode and the offending process results in the early termination of some other simulations before they try to write their restart files. Most simulations that do finish write their restart files successfully.
These are one-off simulations so I have been changing XML setting REST_OPTION=never. However, I imagine this could present a potential problem in the future. This seems like a low priority issue for the community but I wanted people to be aware of it.
An example stack trace, which I don't find very informative, is:
...
260551 9:MPT: #2 MPI_SGI_stacktraceback (
260552 9:MPT: header=header@entry=0x7ffcf009da40 "MPT ERROR: Rank 9(g:9) received signal SIGSEGV(11).\n\tProcess ID: 24926, Host: r11i4n21, Program: /glade/scratch/jmrady/FATES_VTSpacingTrial_Halif
axCoNC_AllPlots_LLpftP_2/bld/cesm.exe\n\tMPT Version: HPE"...) at sig.c:340
260553 9:MPT: #3 0x00002b68eed07fb2 in first_arriver_handler (signo=signo@entry=11,
260554 9:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b68f9300080) at sig.c:489
260555 9:MPT: #4 0x00002b68eed0834b in slave_sig_handler (signo=11,
260556 9:MPT: siginfo=, extra=) at sig.c:564
260557 9:MPT: #5
260558 9:MPT: #6 0x000000000081844a in fatesrestartinterfacemod_mp_set_restart_vectors_ ()
260559 9:MPT: at /glade/work/jmrady/ClmVersions/fates_s1.31.0_api.8.0.0/cime/../src/fates/main/FatesRestartInterfaceMod.F90:1808
260560 9:MPT: #7 0x0000000000521981 in clmfatesinterfacemod_mp_restart_ ()
...
The text was updated successfully, but these errors were encountered: