-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A_WCYCL2000 ne120_oRRS15 failing during ice initialization #856
Comments
Yesterday, the inputdata file seaice.RRS.15to5km.151209.nc was missing from the repo. Is it in the titan's inputdata directory under ice/mpas-cice/oRRS15to5? |
I added everything necessary to run the case (for the two partition sizes that I have tried so far - will need to generate more for ocean and cice if you want to try something different). Note that if you are going to reproduce this, that land initialization (I think) takes forever (20 minutes), so it will be awhile before you get to the problem area. Once we get it working, there may be a new issue about the initialization cost :-). |
With 21600 tasks and a PIO_STRIDE=32, that's way too many I/O processors (my opinion). I think you need to keep the number of I/O tasks as low as possible, but not too low to run out of memory, which is 64 or 128. This is why I always argue against using PIO_STRIDE, and instead suggest people work with PIO_NUMTASKS, since that's the important number that once we find a good version, should be kept fixed independent of the number of tasks used to run the model. |
@mt5555 , while I agree that performance could be improved with better choices, I have not found that jobs would die from this, until recently, and on Titan primarily. In general I have not had your general experiences in my own experiments for quite awhile, and found that playing with the PIO message passing options were often just as effective at improving things. We have large scale communication going on throughout the model - no reason that PIO should be so sensitive to this. |
probably not relevent: but back in the ACME v0.1 days, land initialization was very slow becuase because of a O(nthreads^2) procedure. So if land performance is not an issue, run it with 1 thread and it would reduce the initialization cost. Also, running land on it's own set of processors allows the land initializaiton to be done in parallel with the other components initialization, which we also used to reduce initialization costs. |
@worleyph : yes - all my information if from experience with ACME v0, from 2 years ago, and I agree it may not be relevant here. |
@worleyph There is a problem with cice initialization using pgi. I'm trying to track it down, but it's one of those awful bugs that works fine in debug mode, as well as using the intel compiler. So it's specific to pgi. |
A run of
|
@amametjanov , your Mira issue looks like a "deliberate" abort due to a failed allocate, so is a memory problem? |
@jonbob , I run A_WCYCL2000 with ne30_oEC all of the time with PGI (or at least I have within the past month or so). So, this problem does not occur all of the time. |
@worleyph , I know -- it's a new problem. |
Update: @jonbob and I have had a number of direct e-mails to understand where we are both coming from. @jonbob see this all of the time, but using a_B1850CN. I moved to ne30_oEC and the problem disappeared. I was able to get mpas-o initialization to die by setting
which is the same issue that @jayeshkrishna already identified? |
Pat, I’m having trouble getting the PIO settings to stick - I set them in env_run.xml, but they don’t seem to get used and are back to the original settings after I submit. Is there a secret? Thanks, On Apr 21, 2016, at 2:43 PM, worleyph <[email protected]mailto:[email protected]> wrote: Update: @jonbobhttps://github.com/jonbob and I have had a number of direct e-mails to understand where we are both coming from. @jonbobhttps://github.com/jonbob seems this all of the time, but using a_B1850CN. I moved to ne30_oEC and the problem disappeared. I was able to get mpas-o initialization to die by setting which is the same issue that @jayeshkrishnahttps://github.com/jayeshkrishna already identified? — |
How can you tell that they don't stick? Can you give me an example of what you are trying to do? |
I was just trying to do simple things, like change the stride. But I think it has something to do with env_run.orig, which one of the scripts copies over env_run.xml. I'm testing that right now... |
FYI : #760 also deals with a crash in PIO + Certain_PIO_strides + Titan + OCN_initialization (the bug seems to be inside the MPI library). |
I am also able to recreate this problem on Titan with the following configuration (one of the cases Pat mentioned above),
|
question for someone in the CICE development group ( @jonbob , can you pass this along). Since someone ( @jayeshkrishna ? @jonbob ?) thought that this might be an issue with the namelist read, I compared the ne120_oRRS15 mpas-cice_in with that of ne30_oEC mpas-cice_in (which is workign for me), and the only difference is
vs.
I'm actually surprised that this is the only difference, for such a big difference in grid resolution. Are there no resolution-specific runtime parameters? One minor question, there is a mix of 'true' and '.true.' , and of 'false' and '.false.' throughout the namelist. Seems like it would be better to stick with one style or the other. I'm assuming that they are reated the same. If not, then this is an error waiting to happen. |
Latest: I was able to get the error to disappear by commenting out the logic:
in both mpas-o and mpas-cice (in ocn_comp_mct.F and ice_comp_mct.F). I am now trying with this commented out only in mpas-cice. Makes no sense to me, but making progress? |
Worked with only commenting out redirection to /dev/null in mpas-cice. |
And also worked with only commenting out redirection to /dev/null in mpas-ocean. So appears to be a function of the number of logical units being assigned to /dev/null? |
@worleyph - I'm impressed that you could find that! And what a weird bug! |
Not sure of the diagnosis. If this is it, could probably generate a small reproducer and submit it to PGI. I'll decide tomorrow. |
I can probably give you a reasonable fix for this in the meantime that would also make sense on other machines (possibly). I'll look into it as soon as I get some time. |
From a 2005 comment (http://www.pgroup.com/userforum/viewtopic.php?t=146&sid=1b8e34ae14864dcfdc287ba66b0fc8a3), PGI did not used to support opening /dev/null twice. Just tried the same test program:
and got
If this is the source, then I have no idea why this only happens for ne120_oRRS15 and not ne30_oEC. So, we can't run on Titan without doing something. I can comment out the redirection to /dev/null in mpas-o or mpas-cice in the meantime ( @douglasjacobsen , do you have preference?). @rljacob , what is your opinion? |
@worleyph Either is fine with me. You could even do both just to be safe. I'll look into a more permanent solution in the meantime. Does removing the redirection cause a flood of messages to be written to the cesm log? |
Don't know - testing now. However, just realized that we have another workaround. The reason that the ne30_oEC runs were not failing is that I had the ocean running on different nodes that the ice. These latest ne120_oRRS15 runs were stacked, so that the same processes were running both ocean and ice. I'll restart benchmarking but with ocean on its own nodes. It would be nice to get this "fixed" (if that is the right word) so that this is not a requirement, but running ocean on its own nodes will be a production configuration in any case. |
Yeah, I'm thinking a good solution would be to redirect to something like |
@worleyph I just assigned this to myself, to help me remember that you need a fix from me for this. |
@douglasjacobsen , I'm concerned that /tmp may be too small - not even sure where it exists for the compute nodes on the systems at ALCF, NERSC, and OLCF. Would need to check this out. Would hate for jobs to abort simply because /tmp is not cleaned out between system PMs. Just looked at one of the compute nodes:
and all of the contents seem to be dated as of the time I was allocated the node, so perhaps a sweep occurs with every new job. Still don't know where it physically resides. In any case, I'll give it a try on Titan. Someone else may want to do so on the other systems. I'll also look at performance compared to /dev/null (when ice and ocean are not overlapped). Thanks. |
@worleyph : I did get the A_WCYCL2000 ne120_oRRS15 to run last night on edison, using both the intel and gnu compilers. My tests were under debug mode and only ran a limited number of timesteps, but all components did initialize and run successfully. I'll try today in optimized mode, and work to get necessary model configuration changes into the scripts. I think this means the mapping files are all OK -- with the exception of the runoff map which has a known issue in magnitude. |
PR #875 fixed the problem. Closing. |
This is premature, but it is the same type of error that we have seen before. Would like to confirm that this is not something already known and being worked on.
so on Titan. It is dying, without an error message. Tail of cpl.log:
ice.log is empty. Tail of cesm.log
Here NID 10122 is where processes 0, 1, 2, and 3 are located. This failed twice, both for 21600x4 atmosphere (and all other components, though 21600x1 for mpas-cice and mpas-o). The two runs used
and
I'm trying again, but with 5400x4 this time, and with
Job is in the queue.
Any suggestions? If the next job fails (in the same way), I'll look to putting CICE on its own compute nodes to help isolate what is going on.
Adding @jayeshkrishna , @mt5555 , @golaz. @amametjanov to the conversation.
The text was updated successfully, but these errors were encountered: