Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some yellowstone tests that abort in model run do not exit #383

Closed
billsacks opened this issue Aug 11, 2016 · 13 comments
Closed

Some yellowstone tests that abort in model run do not exit #383

billsacks opened this issue Aug 11, 2016 · 13 comments

Comments

@billsacks
Copy link
Member

billsacks commented Aug 11, 2016

In doing some testing of my new test infrastructure, I introduced a call to endrun in CLM to make sure that an abort is reported correctly. The behavior was not what I expected. I have confirmed that the same behavior exists at the head of master (5df46a2).

Specifically: At least on yellowstone, when the model aborts, the job does not exit from the machine, but instead remains running until it hits its wallclock limit. In addition, the final status is reported as:

PEND LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN

whereas I would have expected:

FAIL LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN

I tested this with LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop with the following code modifications: Either

--- ../../components/clm/src/main/controlMod.F90    2016-08-03 08:10:14.818918163 -0600
+++ LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop.20160811_150527/SourceMods/src.clm/controlMod.F90  2016-08-11 09:07:43.687826068 -0600
@@ -297,6 +297,8 @@

        if (use_init_interp) then
           call apply_use_init_interp(finidat, finidat_interp_source)
+       else
+          call endrun(msg='killing run with init_interp')
        end if

        ! History and restart files

(which forces an abort in the first run of the LII test)

or

--- ../../components/clm/src/main/controlMod.F90    2016-08-03 08:10:14.818918163 -0600
+++ LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop.20160811_150510/SourceMods/src.clm/controlMod.F90  2016-08-11 09:06:53.838141627 -0600
@@ -297,6 +297,7 @@

        if (use_init_interp) then
           call apply_use_init_interp(finidat, finidat_interp_source)
+          call endrun(msg='killing run with init_interp')
        end if

        ! History and restart files

(which forces an abort in the second run of the LII test).

@mvertens
Copy link
Contributor

I noticed the same behavior recently - but thought it was a system problem.

On Thu, Aug 11, 2016 at 9:31 AM, Bill Sacks [email protected]
wrote:

In doing some testing of my new test infrastructure, I introduced a call
to endrun in CLM to make sure that an abort is reported correctly. The
behavior was not what I expected. I have confirmed that the same behavior
exists at the head of master (5df46a2
5df46a2
).

Specifically: At least on yellowstone, when the model aborts, the job does
not exit from the machine, but instead remains running until it hits its
wallclock limit. In addition, the final status is reported as:

PEND LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN

whereas I would have expected:

FAIL LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop RUN

I tested this with LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop
with the following code modifications: Either

--- ../../components/clm/src/main/controlMod.F90 2016-08-03 08:10:14.818918163 -0600+++ LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop.20160811_150527/SourceMods/src.clm/controlMod.F90 2016-08-11 09:07:43.687826068 -0600@@ -297,6 +297,8 @@

    if (use_init_interp) then
       call apply_use_init_interp(finidat, finidat_interp_source)+       else+          call endrun(msg='killing run with init_interp')
    end if

    ! History and restart files

(which forces an abort in the first run of the LII test)

or

--- ../../components/clm/src/main/controlMod.F90 2016-08-03 08:10:14.818918163 -0600+++ LII_D.f10_f10.ICLM45BGCCROP.yellowstone_gnu.clm-crop.20160811_150510/SourceMods/src.clm/controlMod.F90 2016-08-11 09:06:53.838141627 -0600@@ -297,6 +297,7 @@

    if (use_init_interp) then
       call apply_use_init_interp(finidat, finidat_interp_source)+          call endrun(msg='killing run with init_interp')
    end if

    ! History and restart files

(which forces an abort in the second run of the LII test).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#383, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHlxEwYJA0cL-szVgXDFcFFHNa_QkA3Fks5qe0BkgaJpZM4JiPzy
.

@jedwards4b
Copy link
Contributor

I tried to reproduce this using SMS_D and introducing a shr_sys_abort call in cesm_comp_mod.F90, this exited correctly and did not wait for wallclock time.

@billsacks
Copy link
Member Author

@jedwards4b : If I'm looking in the right place, It looks like your run died with:

Exception during run: ERROR: Command: 'TARGET_PROCESSOR_LIST=AUTO_SELECT mpirun.lsf /glade/p/cesmdata/cseg/tools/bin/launch   /glade/scratch/jedwards/SMS_D.f19_g16.X.yellowstone_intel.20160811_153737/bld/cesm.exe  >> cesm.log.$LID 2>&1 ' failed with error '' from dir '/glade/scratch/jedwards/SMS_D.f19_g16.X.yellowstone_intel.20160811_153737/run'

I did one ERS test last night from (I believe) cime5.0.7, and it also exited promptly, with that same error message. No other runs that I have done recently have had that error message, and no other runs I have done recently have exited promptly. I'm not sure if there's a connection.

@jedwards4b
Copy link
Contributor

Further testing seems to indicate that this problem is related to the gnu compiler on yellowstone.

@billsacks
Copy link
Member Author

@jedwards4b : Thank you very much for your investigation of this. I'm fine with this issue being closed, then, if you are.

The exception error that I referenced above still seems weird, but that's probably a different issue.

@gold2718
Copy link

I have seen this behavior using Intel. It seems to happen when only a subset of tasks calls endrun. I have had jobs crash after less than a minute but chew up the whole wallclock allotment (the danger of firing off jobs just before going to bed). I opened an issue with CISL who said:

In this particular case some of the tasks crashed due to code error, I do see "59:forrtl: severe
(151): allocatable array is already allocated". The problem is, IBM PE won't exit if some of the
tasks exit. This is a legal behavior of their launcher (think of mpi_spawn and exit).

@jedwards4b
Copy link
Contributor

@gold2718 I don't think that its as simple as that. The test that I did above that hung with gnu but not intel was aborting on all tasks. I tried the following simple code with gnu and it does not hang:

program testgnuabort
  use mpi
  intrinsic:: backtrace
  integer :: ierr
  integer :: myrank, color, key, mycomm

  call mpi_init(ierr)

  call mpi_comm_rank(MPI_COMM_WORLD, myrank, ierr)

  color = mod(myrank, 2)
  key = myrank/2
  call mpi_comm_split(MPI_COMM_WORLD, color, key, mycomm, ierr)

  if (color == 0) then
     call backtrace()

     call mpi_abort(mycomm, -1, ierr)

     call abort()
  endif

  call mpi_finalize(ierr)

end program testgnuabort

@billsacks billsacks changed the title Tests that abort in model run do not exit, and are reported as PEND yellowstone-gnu tests that abort in model run do not exit, and are reported as PEND Aug 11, 2016
@gold2718
Copy link

@jedwards4b I agree that it is not simple in that different issues happen at different times with different compilers. In addition, the machine's behavior is not consistent (the same executable will sometimes exit and sometimes hang). What I was trying to say in my last message is that it seems to me (i.e., personal anecdotal evidence) that Intel jobs are more likely to hang when not all PEs call endrun.
The fact that jobs can hang even when all jobs call endrun (or abort as in your test job) is disturbing. Have you brought that up with CISL?

@jedwards4b
Copy link
Contributor

I did bring it up with cisl but I couldn't reproduce the problem consistently.

@billsacks
Copy link
Member Author

Renaming from "yellowstone-gnu tests that abort in model run do not exit, and are reported as PEND" to simply "yellowstone-gnu tests that abort in model run do not exit"; the reporting as PEND has been moved to a new issue: #610

@billsacks billsacks changed the title yellowstone-gnu tests that abort in model run do not exit, and are reported as PEND yellowstone-gnu tests that abort in model run do not exit Sep 29, 2016
@billsacks billsacks removed the ty: Bug label Oct 19, 2016
@billsacks
Copy link
Member Author

I just observed this for an intel test: ERP_D_Ld5.f09_g16.ICLM45VIC.yellowstone_intel.clm-vrtlay_interp.GC.1018-1324.45.i

I'm renaming this issue accordingly. (Actually, I guess that's what Steve was saying all along.)

@billsacks billsacks changed the title yellowstone-gnu tests that abort in model run do not exit Some yellowstone tests that abort in model run do not exit Oct 19, 2016
@rljacob
Copy link
Member

rljacob commented Apr 7, 2017

Won't fix since yellowstone is almost retired.

@rljacob rljacob closed this as completed Apr 7, 2017
@ekluzek
Copy link
Contributor

ekluzek commented Jun 7, 2017

OK, I saw problems with this same thing for yellowstone_gnu for LII cases. It runs two cases, and the first aborts with an error -- but the job remains in the queue. We decided to close this since it seemed to just be yellowstone_gnu. So I'll ignore those for now. But, I'm adding this continues to be a problem. LII_D_Ld3.f09_g17.ICLM45BGC.yellowstone_gnu.clm-defaultf09IC with cime5.3.0-alpha.21.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants