Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi-serial nag case fails with pio2 when defining h1 history restart file #1030

Closed
billsacks opened this issue Jun 2, 2020 · 3 comments
Closed
Assignees
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations

Comments

@billsacks
Copy link
Member

Brief summary of bug

The test ERS_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.izumi_nag.clm-monthly fails when using pio2 when defining the h1 history restart file.

General bug information

CTSM version you are using: ctsm1.0.dev093-17-ge55f7451; also suspected to be an issue on ctsm master if cime is switched to a tag on master (cime5.8.23) rather than pointing to the cime branch that we're currently using (which was created to get around this issue) (cime master uses pio2 for this test, whereas the cime branch uses pio1).

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: Only observed for this one test; unknown if it is a more general issue

Details of bug

The above-referenced test dies when writing the restart files, with:

 Opened file ./ERS_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.izumi_nag.clm-monthly.20200601_170252_ohlcig.clm2.r.2003-01-01-00000.nc to write 422
 htape_create : Opening netcdf rhtape ./ERS_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.izumi_nag.clm-monthly.20200601_170252_ohlcig.clm2.rh0.2003-01-01-00000.nc
 Opened file ./ERS_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.izumi_nag.clm-monthly.20200601_170252_ohlcig.clm2.rh0.2003-01-01-00000.nc to write 423
 htape_create : Successfully defined netcdf restart history file  1
 htape_create : Opening netcdf rhtape ./ERS_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.izumi_nag.clm-monthly.20200601_170252_ohlcig.clm2.rh1.2003-01-01-00000.nc
 Opened file ./ERS_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.izumi_nag.clm-monthly.20200601_170252_ohlcig.clm2.rh1.2003-01-01-00000.nc to write 424
 htape_create : Successfully defined netcdf restart history file  2

The job output file indicates a segmentation fault.

I introduced the following diffs to get a better sense of where it's dying:

diff --git a/src/main/histFileMod.F90 b/src/main/histFileMod.F90
index 5737ebfc..4f6f2e07 100644
--- a/src/main/histFileMod.F90
+++ b/src/main/histFileMod.F90
@@ -3753,7 +3753,9 @@ contains
                         // ".rh" // hnum //"."// trim(rdate) //".nc"
 
           call htape_create( t, histrest=.true. )
-
+          write(iulog,*) 'WJS: hist_restart_ncd 1'
+          call shr_sys_flush(iulog)
+          
           ! Add read/write accumultators and counters if needed
           if (.not. tape(t)%is_endhist) then
              do f = 1,tape(t)%nflds
@@ -3815,6 +3817,9 @@ contains
              end do
           endif
 
+          write(iulog,*) 'WJS: hist_restart_ncd 2'
+          call shr_sys_flush(iulog)
+          
           !
           ! Add namelist information to each restart history tape
           !
@@ -3825,7 +3830,10 @@ contains
           call ncd_defdim( ncid_hist(t), 'max_chars'    , max_chars   , dimid)
           call ncd_defdim( ncid_hist(t), 'max_nflds'    , max_nflds   ,  dimid)   
           call ncd_defdim( ncid_hist(t), 'max_flds'     , max_flds    , dimid)   
-       
+
+          write(iulog,*) 'WJS: hist_restart_ncd 3'
+          call shr_sys_flush(iulog)
+          
           call ncd_defvar(ncid=ncid_hist(t), varname='nhtfrq', xtype=ncd_int, &
                long_name="Frequency of history writes",               &
                comment="Namelist item", &
@@ -3854,6 +3862,9 @@ contains
                long_name="Fieldnames to exclude",  &
                dim1name='fname_lenp2', dim2name='max_flds' )
 
+          write(iulog,*) 'WJS: hist_restart_ncd 4'
+          call shr_sys_flush(iulog)
+          
           call ncd_defvar(ncid=ncid_hist(t), varname='nflds', xtype=ncd_int, &
                long_name="Number of fields on file", units="unitless",        &
                dim1name='scalar')
@@ -3865,7 +3876,9 @@ contains
           call ncd_defvar(ncid=ncid_hist(t), varname='begtime', xtype=ncd_double, &
                long_name="Beginning time", units="time units",     &
                dim1name='scalar')
-   
+
+          write(iulog,*) 'WJS: hist_restart_ncd 5'          
+          call shr_sys_flush(iulog)
           call ncd_defvar(ncid=ncid_hist(t), varname='num2d', xtype=ncd_int, &
                long_name="Size of second dimension", units="unitless",     &
                dim1name='max_nflds' )
@@ -3873,6 +3886,8 @@ contains
                long_name="History pointer index", units="unitless",     &
                dim1name='max_nflds' )
 
+          write(iulog,*) 'WJS: hist_restart_ncd 6'
+          call shr_sys_flush(iulog)
           call ncd_defvar(ncid=ncid_hist(t), varname='avgflag', xtype=ncd_char, &
                long_name="Averaging flag", &
                units="A=Average, X=Maximum, M=Minimum, I=Instantaneous, SUM=Sum", &
@@ -3905,8 +3920,11 @@ contains
                long_name="landunit to gridpoint scale type", &
                dim1name='scale_type_string_length', dim2name='max_nflds' )
 
+          write(iulog,*) 'WJS: hist_restart_ncd 7'          
+          call shr_sys_flush(iulog)
           call ncd_enddef(ncid_hist(t))
-
+          write(iulog,*) 'WJS: hist_restart_ncd 8'
+          call shr_sys_flush(iulog)
        end do   ! end of ntapes loop   
 
        RETURN
diff --git a/src/main/ncdio_pio.F90.in b/src/main/ncdio_pio.F90.in
index ff7320bc..afbd0d21 100644
--- a/src/main/ncdio_pio.F90.in
+++ b/src/main/ncdio_pio.F90.in
@@ -406,8 +406,9 @@ contains
     integer :: status   ! error status
     !-----------------------------------------------------------------------
 
+    write(iulog,*) 'WJS: in ncd_enddef'
     status = PIO_enddef(ncid)
-
+    write(iulog,*) 'WJS: done ncd_enddef: ', status
   end subroutine ncd_enddef
 
   !-----------------------------------------------------------------------
diff --git a/src/main/restFileMod.F90 b/src/main/restFileMod.F90
index 4f718b92..c02541e2 100644
--- a/src/main/restFileMod.F90
+++ b/src/main/restFileMod.F90
@@ -7,6 +7,7 @@ module restFileMod
   ! !USES:
 #include "shr_assert.h"
   use shr_kind_mod     , only : r8 => shr_kind_r8
+  use shr_sys_mod    , only : shr_sys_flush
   use decompMod        , only : bounds_type, get_proc_clumps, get_clump_bounds
   use decompMod        , only : BOUNDS_LEVEL_PROC
   use spmdMod          , only : masterproc, mpicom
@@ -104,29 +105,49 @@ contains
     call clm_instRest(bounds, ncid, flag='define', &
          writing_finidat_interp_dest_file=writing_finidat_interp_dest_file)
 
-    if (present(rdate)) then 
+    if (present(rdate)) then
+       write(iulog,*) 'WJS: calling hist_restart_ncd: define'
+       call shr_sys_flush(iulog)
        call hist_restart_ncd (bounds, ncid, flag='define', rdate=rdate )
+       write(iulog,*) 'WJS: done hist_restart_ncd: define'
+       call shr_sys_flush(iulog)
     end if
 
     call restFile_enddef( ncid )
-
+    write(iulog,*) 'WJS: here 1'
+    call shr_sys_flush(iulog)
+    
     ! Write variables
     
     call timemgr_restart_io( ncid, flag='write' )
-
+    write(iulog,*) 'WJS: here 2'
+    call shr_sys_flush(iulog)
+    
     call subgridRestWrite(bounds, ncid, flag='write' )
-
+    write(iulog,*) 'WJS: here 3'
+    call shr_sys_flush(iulog)
+    
     call accumulRest( ncid, flag='write' )
-
+    write(iulog,*) 'WJS: here 4'
+    call shr_sys_flush(iulog)
+    
     call clm_instRest(bounds, ncid, flag='write', &
          writing_finidat_interp_dest_file=writing_finidat_interp_dest_file)
-
+    write(iulog,*) 'WJS: here 5'
+    call shr_sys_flush(iulog)
+    
     call hist_restart_ncd (bounds, ncid, flag='write' )
-
+    write(iulog,*) 'WJS: here 6'
+    call shr_sys_flush(iulog)
+    
     ! Close file 
     
     call restFile_close( ncid )
+    write(iulog,*) 'WJS: here 7'
+    call shr_sys_flush(iulog)
     call restFile_closeRestart( file )
+    write(iulog,*) 'WJS: here 8'
+    call shr_sys_flush(iulog)
     
     ! Write restart pointer file
     

From this I got:

 htape_create : Opening netcdf rhtape ./ERS_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.izumi_nag.clm-monthly.20200601_170252_ohlcig.clm2.rh1.2003-01-01-00000.nc
 Opened file ./ERS_Ly5_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianGs.izumi_nag.clm-monthly.20200601_170252_ohlcig.clm2.rh1.2003-01-01-00000.nc to write 424
 htape_create : Successfully defined netcdf restart history file  2
 WJS: hist_restart_ncd 1
 WJS: hist_restart_ncd 2
 WJS: hist_restart_ncd 3
 WJS: hist_restart_ncd 4
 WJS: hist_restart_ncd 5
 WJS: hist_restart_ncd 6
 WJS: hist_restart_ncd 7
 WJS: in ncd_enddef
 WJS: done ncd_enddef:  0
 WJS: hist_restart_ncd 8
 WJS: done hist_restart_ncd: define
 WJS: in ncd_enddef
 WJS: done ncd_enddef:  0
 WJS: here 1
 WJS: here 2
 WJS: here 3
 WJS: here 4
 WJS: here 5

implying that the death occurs somewhere in:

call hist_restart_ncd (bounds, ncid, flag='write' )

I tried running a similar case in DEBUG mode. It died in the same place (i.e., after htape_create : Successfully defined netcdf restart history file 2; I didn't have all the extra write statements from the above diff, so I'm not sure exactly where it died, but the output upon death was the same as in the non-debug case prior to adding the additional write statements), and didn't give any additional useful information.

@billsacks billsacks added priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations bug something is working incorrectly labels Jun 2, 2020
@billsacks
Copy link
Member Author

I'm labeling this high priority because cime is now using pio2 for mpi-serial cases, so for us to point to cime master, we'll need to either resolve this issue or treat it as an expected failure.

@ekluzek ekluzek added this to the cesm2.2.0 milestone Jun 2, 2020
@ekluzek ekluzek self-assigned this Jun 2, 2020
@billsacks
Copy link
Member Author

PIO1 will be the default for CESM2.2, but moving to PIO2 is a high priority for CESM2.3. So I'm keeping this as high priority, but removing the cesm2.2.0 milestone.

@billsacks
Copy link
Member Author

This is fixed by some combination of #1095 (which came to master in ctsm1.0.dev110) and a cime update which will be coming in soon. I don't remember which of these changes (the ctsm changes and/or the cime changes) actually fix this issue, but since the cime changes are coming in imminently, I'm going ahead and closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations
Projects
No open projects
Development

No branches or pull requests

2 participants