Some yellowstone tests that abort in model run do not exit #383

billsacks · 2016-08-11T15:31:47Z

In doing some testing of my new test infrastructure, I introduced a call to endrun in CLM to make sure that an abort is reported correctly. The behavior was not what I expected. I have confirmed that the same behavior exists at the head of master (5df46a2).

Specifically: At least on yellowstone, when the model aborts, the job does not exit from the machine, but instead remains running until it hits its wallclock limit. In addition, the final status is reported as:

PEND LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN

whereas I would have expected:

FAIL LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN

I tested this with LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop with the following code modifications: Either

--- ../../components/clm/src/main/controlMod.F90    2016-08-03 08:10:14.818918163 -0600
+++ LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop.20160811_150527/SourceMods/src.clm/controlMod.F90  2016-08-11 09:07:43.687826068 -0600
@@ -297,6 +297,8 @@

        if (use_init_interp) then
           call apply_use_init_interp(finidat, finidat_interp_source)
+       else
+          call endrun(msg='killing run with init_interp')
        end if

        ! History and restart files

(which forces an abort in the first run of the LII test)

or

--- ../../components/clm/src/main/controlMod.F90    2016-08-03 08:10:14.818918163 -0600
+++ LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop.20160811_150510/SourceMods/src.clm/controlMod.F90  2016-08-11 09:06:53.838141627 -0600
@@ -297,6 +297,7 @@

        if (use_init_interp) then
           call apply_use_init_interp(finidat, finidat_interp_source)
+          call endrun(msg='killing run with init_interp')
        end if

        ! History and restart files

(which forces an abort in the second run of the LII test).

The text was updated successfully, but these errors were encountered:

mvertens · 2016-08-11T16:03:02Z

I noticed the same behavior recently - but thought it was a system problem.

On Thu, Aug 11, 2016 at 9:31 AM, Bill Sacks [email protected]
wrote:

In doing some testing of my new test infrastructure, I introduced a call
to endrun in CLM to make sure that an abort is reported correctly. The
behavior was not what I expected. I have confirmed that the same behavior
exists at the head of master (5df46a2
5df46a2
).

Specifically: At least on yellowstone, when the model aborts, the job does
not exit from the machine, but instead remains running until it hits its
wallclock limit. In addition, the final status is reported as:

PEND LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN

whereas I would have expected:

FAIL LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN

I tested this with LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop
with the following code modifications: Either

--- ../../components/clm/src/main/controlMod.F90 2016-08-03 08:10:14.818918163 -0600+++ LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop.20160811_150527/SourceMods/src.clm/controlMod.F90 2016-08-11 09:07:43.687826068 -0600@@ -297,6 +297,8 @@
    if (use_init_interp) then
       call apply_use_init_interp(finidat, finidat_interp_source)+       else+          call endrun(msg='killing run with init_interp')
    end if

    ! History and restart files
(which forces an abort in the first run of the LII test)

or

--- ../../components/clm/src/main/controlMod.F90 2016-08-03 08:10:14.818918163 -0600+++ LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop.20160811_150510/SourceMods/src.clm/controlMod.F90 2016-08-11 09:06:53.838141627 -0600@@ -297,6 +297,7 @@
    if (use_init_interp) then
       call apply_use_init_interp(finidat, finidat_interp_source)+          call endrun(msg='killing run with init_interp')
    end if

    ! History and restart files
(which forces an abort in the second run of the LII test).

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#383, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHlxEwYJA0cL-szVgXDFcFFHNa_QkA3Fks5qe0BkgaJpZM4JiPzy
.

jedwards4b · 2016-08-11T16:03:24Z

I tried to reproduce this using SMS_D and introducing a shr_sys_abort call in cesm_comp_mod.F90, this exited correctly and did not wait for wallclock time.

billsacks · 2016-08-11T16:43:03Z

@jedwards4b : If I'm looking in the right place, It looks like your run died with:

Exception during run: ERROR: Command: 'TARGET_PROCESSOR_LIST=AUTO_SELECT mpirun.lsf /glade/p/cesmdata/cseg/tools/bin/launch   /glade/scratch/jedwards/SMS_D.f19_g16.X.yellowstone_intel.20160811_153737/bld/cesm.exe  >> cesm.log.$LID 2>&1 ' failed with error '' from dir '/glade/scratch/jedwards/SMS_D.f19_g16.X.yellowstone_intel.20160811_153737/run'

I did one ERS test last night from (I believe) cime5.0.7, and it also exited promptly, with that same error message. No other runs that I have done recently have had that error message, and no other runs I have done recently have exited promptly. I'm not sure if there's a connection.

jedwards4b · 2016-08-11T17:26:38Z

Further testing seems to indicate that this problem is related to the gnu compiler on yellowstone.

billsacks · 2016-08-11T17:33:46Z

@jedwards4b : Thank you very much for your investigation of this. I'm fine with this issue being closed, then, if you are.

The exception error that I referenced above still seems weird, but that's probably a different issue.

gold2718 · 2016-08-11T18:46:18Z

I have seen this behavior using Intel. It seems to happen when only a subset of tasks calls endrun. I have had jobs crash after less than a minute but chew up the whole wallclock allotment (the danger of firing off jobs just before going to bed). I opened an issue with CISL who said:

In this particular case some of the tasks crashed due to code error, I do see "59:forrtl: severe
(151): allocatable array is already allocated". The problem is, IBM PE won't exit if some of the
tasks exit. This is a legal behavior of their launcher (think of mpi_spawn and exit).

jedwards4b · 2016-08-11T19:06:05Z

@gold2718 I don't think that its as simple as that. The test that I did above that hung with gnu but not intel was aborting on all tasks. I tried the following simple code with gnu and it does not hang:

program testgnuabort
  use mpi
  intrinsic:: backtrace
  integer :: ierr
  integer :: myrank, color, key, mycomm

  call mpi_init(ierr)

  call mpi_comm_rank(MPI_COMM_WORLD, myrank, ierr)

  color = mod(myrank, 2)
  key = myrank/2
  call mpi_comm_split(MPI_COMM_WORLD, color, key, mycomm, ierr)

  if (color == 0) then
     call backtrace()

     call mpi_abort(mycomm, -1, ierr)

     call abort()
  endif

  call mpi_finalize(ierr)

end program testgnuabort

gold2718 · 2016-08-12T00:30:07Z

@jedwards4b I agree that it is not simple in that different issues happen at different times with different compilers. In addition, the machine's behavior is not consistent (the same executable will sometimes exit and sometimes hang). What I was trying to say in my last message is that it seems to me (i.e., personal anecdotal evidence) that Intel jobs are more likely to hang when not all PEs call endrun.
The fact that jobs can hang even when all jobs call endrun (or abort as in your test job) is disturbing. Have you brought that up with CISL?

jedwards4b · 2016-08-12T02:40:03Z

I did bring it up with cisl but I couldn't reproduce the problem consistently.

billsacks · 2016-09-29T20:27:21Z

Renaming from "yellowstone-gnu tests that abort in model run do not exit, and are reported as PEND" to simply "yellowstone-gnu tests that abort in model run do not exit"; the reporting as PEND has been moved to a new issue: #610

billsacks · 2016-10-19T02:32:33Z

I just observed this for an intel test: ERP_D_Ld5.f09_g16.ICLM45VIC.yellowstone_intel.clm-vrtlay_interp.GC.1018-1324.45.i

I'm renaming this issue accordingly. (Actually, I guess that's what Steve was saying all along.)

rljacob · 2017-04-07T17:29:53Z

Won't fix since yellowstone is almost retired.

ekluzek · 2017-06-07T17:08:33Z

OK, I saw problems with this same thing for yellowstone_gnu for LII cases. It runs two cases, and the first aborts with an error -- but the job remains in the queue. We decided to close this since it seemed to just be yellowstone_gnu. So I'll ignore those for now. But, I'm adding this continues to be a problem. LII_D_Ld3.f09_g17.ICLM45BGC.yellowstone_gnu.clm-defaultf09IC with cime5.3.0-alpha.21.

billsacks added the ty: Bug label Aug 11, 2016

rljacob mentioned this issue Aug 11, 2016

Need ability to detect and kill hung jobs #386

Closed

billsacks changed the title ~~Tests that abort in model run do not exit, and are reported as PEND~~ yellowstone-gnu tests that abort in model run do not exit, and are reported as PEND Aug 11, 2016

billsacks removed the ty: Bug label Aug 12, 2016

rljacob added the ty: Bug label Aug 26, 2016

billsacks changed the title ~~yellowstone-gnu tests that abort in model run do not exit, and are reported as PEND~~ yellowstone-gnu tests that abort in model run do not exit Sep 29, 2016

billsacks mentioned this issue Sep 29, 2016

Tests that are running or have hung are reported as PEND #610

Closed

billsacks removed the ty: Bug label Oct 19, 2016

billsacks changed the title ~~yellowstone-gnu tests that abort in model run do not exit~~ Some yellowstone tests that abort in model run do not exit Oct 19, 2016

rljacob closed this as completed Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some yellowstone tests that abort in model run do not exit #383

Some yellowstone tests that abort in model run do not exit #383

billsacks commented Aug 11, 2016 •

edited by rljacob

Loading

mvertens commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

billsacks commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

billsacks commented Aug 11, 2016

gold2718 commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

gold2718 commented Aug 12, 2016

jedwards4b commented Aug 12, 2016

billsacks commented Sep 29, 2016

billsacks commented Oct 19, 2016

rljacob commented Apr 7, 2017

ekluzek commented Jun 7, 2017

Some yellowstone tests that abort in model run do not exit #383

Some yellowstone tests that abort in model run do not exit #383

Comments

billsacks commented Aug 11, 2016 • edited by rljacob Loading

mvertens commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

billsacks commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

billsacks commented Aug 11, 2016

gold2718 commented Aug 11, 2016

jedwards4b commented Aug 11, 2016

gold2718 commented Aug 12, 2016

jedwards4b commented Aug 12, 2016

billsacks commented Sep 29, 2016

billsacks commented Oct 19, 2016

rljacob commented Apr 7, 2017

ekluzek commented Jun 7, 2017

billsacks commented Aug 11, 2016 •

edited by rljacob

Loading