Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VIC test is failing at f09 resolution with signal #384

Closed
ekluzek opened this issue May 16, 2018 · 4 comments
Closed

VIC test is failing at f09 resolution with signal #384

ekluzek opened this issue May 16, 2018 · 4 comments
Assignees

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented May 16, 2018

In clm5.0.dev011 the ERP_D_Ld5.f09_g17.I2000Clm50Vic.cheyenne_intel.clm-vrtlay test started failing with a signal trap (floating point error or a memory error?).

I ran SMS tests at lower resolutions f10_g37 and f19_g17 successfully. So this maybe an issue at this specific resolution. This tag did lower the number of processors used by default for f09.

37:MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.19-35.1.x86_64
37:MPT: (gdb) #0  0x00002aaaafac141c in waitpid ()
37:MPT:    from /glade/u/apps/ch/os/lib64/libpthread.so.0
37:MPT: #1  0x00002aaab01f2576 in mpi_sgi_system (
37:MPT: #2  MPI_SGI_stacktraceback (
37:MPT:     header=header@entry=0x7ffffffea540 'MPT ERROR: Rank 37(g:37) received signal SIGFPE(8).\n\tProcess ID: 13692, Host: r1i3n6, Program: /glade2/scratch2/erik/ERP_D_Ld5.f09_g17.I2000Clm50Vic.cheyenne_intel.clm-vrtlay.GC.clm50dev011chintela/bl'...) at sig.c:339
37:MPT: #3  0x00002aaab01f2778 in first_arriver_handler (signo=signo@entry=8, 
37:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2aaabaac0500) at sig.c:488
37:MPT: #4  0x00002aaab01f2b5b in slave_sig_handler (signo=8, siginfo=<optimized out>, 
37:MPT:     extra=<optimized out>) at sig.c:563
37:MPT: #5  <signal handler called>
37:MPT: #6  0x0000000001fca8c3 in soilhydrologyinittimeconstmod::initsoilparvic (c=79, 
37:MPT:     claycol=0x2aac83a02900, sandcol=0x2aab9fb848c0, om_fraccol=0x2aac83a06940, 
37:MPT:     soilhydrology_inst=...)
37:MPT:     at /glade/p/work/erik/ctsm/src/biogeophys/SoilHydrologyInitTimeConstMod.F90:426
37:MPT: #7  0x0000000001fc3ccb in soilhydrologyinittimeconstmod::soilhydrologyinittimeconst (bounds=..., soilhydrology_inst=...)
37:MPT:     at /glade/p/work/erik/ctsm/src/biogeophys/SoilHydrologyInitTimeConstMod.F90:307
37:MPT: #8  0x00000000008bd8ab in clm_instmod::clm_instinit (bounds=...)
37:MPT:     at /glade/p/work/erik/ctsm/src/main/clm_instMod.F90:305
37:MPT: #9  0x00000000008b59ba in clm_initializemod::initialize2 ()
37:MPT:     at /glade/p/work/erik/ctsm/src/main/clm_initializeMod.F90:434
37:MPT: #10 0x000000000084667d in lnd_comp_mct::lnd_init_mct (eclock=..., cdata_l=..., 
37:MPT:     x2l_l=..., l2x_l=..., nlfilename=..., .tmp.NLFILENAME.len_V$50db=6)
37:MPT:     at /glade/p/work/erik/ctsm/src/cpl/lnd_comp_mct.F90:233
37:MPT: #11 0x0000000000453f8d in component_mod::component_init_cc (eclock=..., 
37:MPT:     comp=..., infodata=..., nlfilename=..., 
37:MPT:     seq_flds_x2c_fluxes=<error reading variable: Cannot access memory at address 0x0>, 
37:MPT:     seq_flds_c2x_fluxes=<error reading variable: virtual memory exhausted: can't allocate 140737488307416 bytes.>, .tmp.NLFILENAME.len_V$2fc9=6, 
37:MPT:     .tmp.SEQ_FLDS_X2C_FLUXES.len_V$2fcc=0, 
37:MPT:     .tmp.SEQ_FLDS_C2X_FLUXES.len_V$2fcf=0)
37:MPT:     at /glade/p/work/erik/ctsm/cime/src/drivers/mct/main/component_mod.F90:267
37:MPT: #12 0x000000000041a434 in cime_comp_mod::cime_init ()
37:MPT:     at /glade/p/work/erik/ctsm/cime/src/drivers/mct/main/cime_comp_mod.F90:1181
37:MPT: #13 0x0000000000449ea1 in cime_driver ()
37:MPT:     at /glade/p/work/erik/ctsm/cime/src/drivers/mct/main/cime_driver.F90:92
37:MPT: #14 0x000000000040899e in main ()
37:MPT: #15 0x00002aaab04dcb25 in __libc_start_main ()
37:MPT:    from /glade/u/apps/ch/os/lib64/libc.so.6
37:MPT: #16 0x00000000004088a9 in _start () at ../sysdeps/x86_64/start.S:122
37:MPT: (gdb) A debugging session is active.
@ekluzek ekluzek self-assigned this May 16, 2018
@billsacks
Copy link
Member

This passed with the cime update to the latest master. I'll close this if it continues to pass.

@billsacks
Copy link
Member

This passed again in my rerun of the test suite. I'll close it when I bring my cime-update branch to master.

ekluzek added a commit that referenced this issue Sep 8, 2018
Update cime to cime5.7.3

Update cime from cime5.6.10 to cime5.7.3. To support this change, there
are also minor code changes related to the pause-resume implementation
(from Erik Kluzek).

Fixes #384
lawrencepj1 pushed a commit to lawrencepj1/ctsm that referenced this issue Sep 22, 2018
Update cime to cime5.7.3

Update cime from cime5.6.10 to cime5.7.3. To support this change, there
are also minor code changes related to the pause-resume implementation
(from Erik Kluzek).

Fixes ESCOMP#384
@billsacks
Copy link
Member

This is failing for me on the release-clm5.0 branch in the same way as noted in the original issue. It failed in my testing for release-clm5.0.10 (release-clm5.0.09-43-gde4a13456), but it also failed when I checked out a fresh copy of release-clm5.0.09 and reran it from there.

I haven't had it fail for me on master since this issue was closed, so I'm tentatively still thinking that this is fixed on master. But I'm going to add it back to the ExpectedFails list on the release branch.

billsacks added a commit to billsacks/ctsm that referenced this issue Feb 22, 2019
Update cime to cime5.7.3

Update cime from cime5.6.10 to cime5.7.3. To support this change, there
are also minor code changes related to the pause-resume implementation
(from Erik Kluzek).

Fixes ESCOMP#384
billsacks pushed a commit to billsacks/ctsm that referenced this issue Feb 22, 2019
Update cime to cime5.7.3

Update cime from cime5.6.10 to cime5.7.3. To support this change, there
are also minor code changes related to the pause-resume implementation
(from Erik Kluzek).

Fixes ESCOMP#384
@ekluzek
Copy link
Collaborator Author

ekluzek commented Jun 7, 2019

It looks like this is actually working of late on the release branch. It failed in release-clm5.0.10 (at least output files weren't created), and it failed before release-clm5.0.04. But, there's been a long string if it just running fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants