Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A_WCYCL2000 ne120_oRRS15 failing during ice initialization #856

Closed
worleyph opened this issue Apr 21, 2016 · 33 comments
Closed

A_WCYCL2000 ne120_oRRS15 failing during ice initialization #856

worleyph opened this issue Apr 21, 2016 · 33 comments

Comments

@worleyph
Copy link
Contributor

This is premature, but it is the same type of error that we have seen before. Would like to confirm that this is not something already known and being worked on.

 -compset A_WCYCL2000 -res ne120_oRRS15 -mach titan -compiler pgi_acc -camse_target preqx_acc

so on Titan. It is dying, without an error message. Tail of cpl.log:

 (seq_mct_drv) : Initialize each component: atm, lnd, rof, ocn, ice, glc, wav
 (component_init_cc:mct) : Initialize component atm
 (component_init_cc:mct) : Initialize component lnd
 (component_init_cc:mct) : Initialize component rof
 (component_init_cc:mct) : Initialize component ocn
 (component_init_cc:mct) : Initialize component ice

ice.log is empty. Tail of cesm.log

  ----- done parsing run-time I/O from streams.ocean -----

 [NID 10122] 2016-04-21 05:52:24 Apid 10995029: initiated application termination

Here NID 10122 is where processes 0, 1, 2, and 3 are located. This failed twice, both for 21600x4 atmosphere (and all other components, though 21600x1 for mpas-cice and mpas-o). The two runs used

 <entry id="PIO_STRIDE"   value="24"  />

and

 <entry id="PIO_STRIDE"   value="32"  />

I'm trying again, but with 5400x4 this time, and with

 <entry id="PIO_STRIDE"   value="-1"  />

Job is in the queue.

Any suggestions? If the next job fails (in the same way), I'll look to putting CICE on its own compute nodes to help isolate what is going on.

Adding @jayeshkrishna , @mt5555 , @golaz. @amametjanov to the conversation.

@amametjanov
Copy link
Member

amametjanov commented Apr 21, 2016

Yesterday, the inputdata file seaice.RRS.15to5km.151209.nc was missing from the repo. Is it in the titan's inputdata directory under ice/mpas-cice/oRRS15to5?

@worleyph
Copy link
Contributor Author

I added everything necessary to run the case (for the two partition sizes that I have tried so far - will need to generate more for ocean and cice if you want to try something different).

Note that if you are going to reproduce this, that land initialization (I think) takes forever (20 minutes), so it will be awhile before you get to the problem area. Once we get it working, there may be a new issue about the initialization cost :-).

@mt5555
Copy link
Contributor

mt5555 commented Apr 21, 2016

With 21600 tasks and a PIO_STRIDE=32, that's way too many I/O processors (my opinion).

I think you need to keep the number of I/O tasks as low as possible, but not too low to run out of memory, which is 64 or 128.

This is why I always argue against using PIO_STRIDE, and instead suggest people work with PIO_NUMTASKS, since that's the important number that once we find a good version, should be kept fixed independent of the number of tasks used to run the model.

@worleyph
Copy link
Contributor Author

@mt5555 , while I agree that performance could be improved with better choices, I have not found that jobs would die from this, until recently, and on Titan primarily. In general I have not had your general experiences in my own experiments for quite awhile, and found that playing with the PIO message passing options were often just as effective at improving things. We have large scale communication going on throughout the model - no reason that PIO should be so sensitive to this.

@mt5555
Copy link
Contributor

mt5555 commented Apr 21, 2016

probably not relevent: but back in the ACME v0.1 days, land initialization was very slow becuase because of a O(nthreads^2) procedure. So if land performance is not an issue, run it with 1 thread and it would reduce the initialization cost.

Also, running land on it's own set of processors allows the land initializaiton to be done in parallel with the other components initialization, which we also used to reduce initialization costs.

@mt5555
Copy link
Contributor

mt5555 commented Apr 21, 2016

@worleyph : yes - all my information if from experience with ACME v0, from 2 years ago, and I agree it may not be relevant here.

@jonbob
Copy link
Contributor

jonbob commented Apr 21, 2016

@worleyph There is a problem with cice initialization using pgi. I'm trying to track it down, but it's one of those awful bugs that works fine in debug mode, as well as using the intel compiler. So it's specific to pgi.

@amametjanov
Copy link
Member

A run of -compset A_WCYCL2000 -res ne120_oRRS15 -mach mira failed during ocn initialization:

  • casedir: /home/azamatm/repos/ACME-integration/cime/scripts/cases/AWCYCL2000-ne120oRRS15-tune01
------------------------------------------------------------------------
Program   : /projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/cesm.exe
------------------------------------------------------------------------
+++ID Rank: 7200, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN 

00000000046c0bd4
00004e96.long_branch_r2off.__wrap_fclose+0
:0

0000000003c4108c
piodie
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/externals/pio/pio/pio_support.F90:102

0000000003cca80c
alloc_check_1d_long
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/externals/pio/pio/alloc_mod.F90.in:96

0000000003d0c1cc
compute_counts
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/externals/pio/pio/box_rearrange.F90.in:1381

0000000003d0e740
box_rearrange_create
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/externals/pio/pio/box_rearrange.F90.in:1079

0000000003ccae64
rearrange_create_box_
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/externals/pio/pio/rearrange.F90.in:165

0000000003c3756c
pio_initdecomp_dof_i8
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/externals/pio/pio/piolib_mod.F90:1186

000000000294edcc
mpas_io_set_var_indices
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ice/source/framework/mpas_io.f90:1299

0000000002caf0a4
mpas_streamaddfield_generic
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ice/source/framework/mpas_io_streams.f90:2311

0000000002cb4f28
mpas_streamaddfield_2dreal
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ice/source/framework/mpas_io_streams.f90:1351

0000000002988810
build_stream
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ice/source/framework/mpas_stream_manager.f90:3643

000000000298b24c
read_stream
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ice/source/framework/mpas_stream_manager.f90:3037

000000000298da0c
mpas_stream_mgr_read
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ice/source/framework/mpas_stream_manager.f90:2873

0000000003750a90
ocn_forward_mode_init
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ocn/source/core_ocean/mode_forward/mpas_ocn_forward_mode.f90:159

000000000388e6d4
ocn_core_init
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ocn/source/core_ocean/driver/mpas_ocn_core.f90:79

000000000355ec18
ocn_init_mct
/gpfs/mira-fs1/projects/HiRes_EarthSys/azamatm/AWCYCL2000-ne120oRRS15-tune01/bld/ocn/source/ocean_cesm_driver/ocn_comp_mct.f90:475

000000000103277c
component_init_cc
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/driver_cpl/driver/component_mod.F90:230

0000000001014e64
cesm_init
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/driver_cpl/driver/cesm_comp_mod.F90:1163

0000000001026324
cesm_driver
/gpfs/mira-home/azamatm/repos/ACME-integration/cime/driver_cpl/driver/cesm_driver.F90:102

0000000004d11ec8
generic_start_main
/bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226

0000000004d121c4
__libc_start_main
/bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194

0000000000000000
??
??:0

@worleyph
Copy link
Contributor Author

@amametjanov , your Mira issue looks like a "deliberate" abort due to a failed allocate, so is a memory problem?

@worleyph
Copy link
Contributor Author

@jonbob , I run A_WCYCL2000 with ne30_oEC all of the time with PGI (or at least I have within the past month or so). So, this problem does not occur all of the time.

@jonbob
Copy link
Contributor

jonbob commented Apr 21, 2016

@worleyph , I know -- it's a new problem.

@worleyph
Copy link
Contributor Author

worleyph commented Apr 21, 2016

Update: @jonbob and I have had a number of direct e-mails to understand where we are both coming from. @jonbob see this all of the time, but using a_B1850CN.

I moved to ne30_oEC and the problem disappeared. I was able to get mpas-o initialization to die by setting

 <entry id="PIO_STRIDE"   value="32"  />

which is the same issue that @jayeshkrishna already identified?

@jonbob
Copy link
Contributor

jonbob commented Apr 21, 2016

Pat,

I’m having trouble getting the PIO settings to stick - I set them in env_run.xml, but they don’t seem to get used and are back to the original settings after I submit. Is there a secret?

Thanks,
Jon

On Apr 21, 2016, at 2:43 PM, worleyph <[email protected]mailto:[email protected]> wrote:

Update: @jonbobhttps://github.com/jonbob and I have had a number of direct e-mails to understand where we are both coming from. @jonbobhttps://github.com/jonbob seems this all of the time, but using a_B1850CN.

I moved to ne30_oEC and the problem disappeared. I was able to get mpas-o initialization to die by setting

which is the same issue that @jayeshkrishnahttps://github.com/jayeshkrishna already identified?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHubhttps://github.com//issues/856#issuecomment-213106609

@worleyph
Copy link
Contributor Author

How can you tell that they don't stick? Can you give me an example of what you are trying to do?

@jonbob
Copy link
Contributor

jonbob commented Apr 21, 2016

I was just trying to do simple things, like change the stride. But I think it has something to do with env_run.orig, which one of the scripts copies over env_run.xml. I'm testing that right now...

@jayeshkrishna
Copy link
Contributor

FYI : #760 also deals with a crash in PIO + Certain_PIO_strides + Titan + OCN_initialization (the bug seems to be inside the MPI library).

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Apr 21, 2016

I am also able to recreate this problem on Titan with the following configuration (one of the cases Pat mentioned above),

  • ./create_newcase -compset A_WCYCL2000 -res ne120_oRRS15 -mach titan -compiler pgi_acc -camse_target preqx_acc
  • ATM/LND/CPL/ROF : 5400x4 (NTASKS x NTHRDS)
  • ICE/OCN : 5400x1
  • GLC/WAV : 1x4

@worleyph
Copy link
Contributor Author

question for someone in the CICE development group ( @jonbob , can you pass this along). Since someone ( @jayeshkrishna ? @jonbob ?) thought that this might be an issue with the namelist read, I compared the ne120_oRRS15 mpas-cice_in with that of ne30_oEC mpas-cice_in (which is workign for me), and the only difference is

  config_block_decomp_file_prefix = '/lustre/atlas1/cli900/world-shared/cesm/inputdata/ice/mpas-cice/oRRS15to5/mpas-cice.graph.info.151209.part.'

vs.

 config_block_decomp_file_prefix = '/lustre/atlas1/cli900/world-shared/cesm/inputdata/ice/mpas-cice/oEC60to30/mpas-cice.graph.info.151020.part.'

I'm actually surprised that this is the only difference, for such a big difference in grid resolution. Are there no resolution-specific runtime parameters?

One minor question, there is a mix of 'true' and '.true.' , and of 'false' and '.false.' throughout the namelist. Seems like it would be better to stick with one style or the other. I'm assuming that they are reated the same. If not, then this is an error waiting to happen.

@worleyph
Copy link
Contributor Author

Latest: I was able to get the error to disappear by commenting out the logic:

     if ( iam /= 0 ) then
        open(stdout_shr, file='/dev/null', position='APPEND')
     else
        ...
     endif

in both mpas-o and mpas-cice (in ocn_comp_mct.F and ice_comp_mct.F). I am now trying with this commented out only in mpas-cice. Makes no sense to me, but making progress?

@worleyph
Copy link
Contributor Author

Worked with only commenting out redirection to /dev/null in mpas-cice.

@worleyph
Copy link
Contributor Author

And also worked with only commenting out redirection to /dev/null in mpas-ocean.

So appears to be a function of the number of logical units being assigned to /dev/null?

@jonbob
Copy link
Contributor

jonbob commented Apr 27, 2016

@worleyph - I'm impressed that you could find that! And what a weird bug!

@worleyph
Copy link
Contributor Author

Not sure of the diagnosis. If this is it, could probably generate a small reproducer and submit it to PGI. I'll decide tomorrow.

@douglasjacobsen
Copy link
Member

I can probably give you a reasonable fix for this in the meantime that would also make sense on other machines (possibly).

I'll look into it as soon as I get some time.

@worleyph
Copy link
Contributor Author

From a 2005 comment (http://www.pgroup.com/userforum/viewtopic.php?t=146&sid=1b8e34ae14864dcfdc287ba66b0fc8a3), PGI did not used to support opening /dev/null twice.

Just tried the same test program:

 program test

 open(1,file='/dev/null')
 open(2,file='/dev/null')

 end program

and got

 PGFIO-F-207/OPEN/unit=2/file is already connected to another unit.

If this is the source, then I have no idea why this only happens for ne120_oRRS15 and not ne30_oEC.

So, we can't run on Titan without doing something. I can comment out the redirection to /dev/null in mpas-o or mpas-cice in the meantime ( @douglasjacobsen , do you have preference?). @rljacob , what is your opinion?

@douglasjacobsen
Copy link
Member

@worleyph Either is fine with me. You could even do both just to be safe.

I'll look into a more permanent solution in the meantime.

Does removing the redirection cause a flood of messages to be written to the cesm log?

@worleyph
Copy link
Contributor Author

Does removing the redirection cause a flood of messages to be written to the cesm log?

Don't know - testing now.

However, just realized that we have another workaround. The reason that the ne30_oEC runs were not failing is that I had the ocean running on different nodes that the ice. These latest ne120_oRRS15 runs were stacked, so that the same processes were running both ocean and ice.

I'll restart benchmarking but with ocean on its own nodes. It would be nice to get this "fixed" (if that is the right word) so that this is not a requirement, but running ocean on its own nodes will be a production configuration in any case.

@douglasjacobsen
Copy link
Member

Yeah, I'm thinking a good solution would be to redirect to something like /tmp/ice.log.{procID} instead of /dev/null. Then they would each have their own file.

@douglasjacobsen
Copy link
Member

@worleyph I just assigned this to myself, to help me remember that you need a fix from me for this.

@douglasjacobsen
Copy link
Member

@worleyph My hope is that #875 fixes this issue. If you could test it on titan, could you let me know if it fixes the issue for you?

@worleyph
Copy link
Contributor Author

@douglasjacobsen , I'm concerned that /tmp may be too small - not even sure where it exists for the compute nodes on the systems at ALCF, NERSC, and OLCF. Would need to check this out. Would hate for jobs to abort simply because /tmp is not cleaned out between system PMs.

Just looked at one of the compute nodes:

 > aprun -n 1 df /tmp
 Filesystem     1K-blocks  Used Available Use% Mounted on
 rwtmp           16541884   160  16541724   1% /tmp

and all of the contents seem to be dated as of the time I was allocated the node, so perhaps a sweep occurs with every new job. Still don't know where it physically resides.

In any case, I'll give it a try on Titan. Someone else may want to do so on the other systems. I'll also look at performance compared to /dev/null (when ice and ocean are not overlapped).

Thanks.

@jonbob
Copy link
Contributor

jonbob commented May 27, 2016

@worleyph : I did get the A_WCYCL2000 ne120_oRRS15 to run last night on edison, using both the intel and gnu compilers. My tests were under debug mode and only ran a limited number of timesteps, but all components did initialize and run successfully. I'll try today in optimized mode, and work to get necessary model configuration changes into the scripts. I think this means the mapping files are all OK -- with the exception of the runoff map which has a known issue in magnitude.

@worleyph
Copy link
Contributor Author

worleyph commented Jun 9, 2016

PR #875 fixed the problem. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants