-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PE layout of 384x3 on cheyenne fails with updated ESMF library #673
Comments
Do we understand why 384x3 PE layout is a problem? |
I'll let @jedwards4b answer that. But my understanding it's a change in how ESMF handles tasks and threads. |
I was seeing this with our MCT WACCM tests. At a meeting today, the scientists agreed that we can drop these tests which we will be doing with our next CAM tag. Does this issue expand to tests other than this? |
Francis had a good point that I did not follow up on. That is that 384x3 fits properly on 32 cheyenne nodes and so my reasoning for this change doesn't make sense. I'll look into it further. |
These are the prealpha tests that failed with 384x3, but pass with 360x3. SMS_D_Ln9_Vnuopc.f09_f09_mg17.FCHIST.cheyenne_intel.cam-outfrq9s_ocnemis |
@fischer-ncar - What are the error messages that you are getting from these runs and have you verified they are repeatable failures? I ran CAM regression test last night for an upcoming CAM tag and I had several weird failures which I'd not seen before in previous testing. I think cheyenne was giving me machine hiccups and I'm running them again. Also, the test ERP_Ld3_Vnuopc.f09_f09_mg17.FWHIST.cheyenne_intel.cam-reduced_hist1d1 ran fine for me last night and it used the 384x3 PE layout. I personally have not had any failures using the 384x3 layouts in anything other than the MCT tests. |
Also, I should mention that cam6_3_079 which is for PR #666 will have the problematic MCT WACCM tests removed. I'm hopeful that tag will be coming in the next day or two (dependent on how cheyenne treats my tests) |
@cacraigucar The failures are repeatable, and the error message I'm getting is. MPT ERROR: Rank 297(g:297) received signal SIGFPE(8). |
@fvitt @peverwhee @nusbaume Do we want to get this issue fixed in the tag that's going into alpha10c? Or do you want to wait for @jedwards4b to look into this issue? |
It would be good to understand why the 384x3 PE layout is a problem. I vote for waiting for Jim to look into it. |
I agree with Francis, it would be good to get to the bottom of this, which probably means waiting for Jim. |
Running test SMS_D_Ln9_Vnuopc.f09_f09_mg17.FCHIST.cheyenne_intel.cam-outfrq9s_ocnemis I set a breakpoint at |
Fixed by ESCOMP/CDEPS#194 |
@fischer-ncar do you agree that this ticket can be closed now? |
Bit late, but yes this ticket can be closed. |
Issue Type
Infrastructure Update
Issue Description
PE layout for compset="_(CAM50|CAM60)%(CC|CV|CF|WC)" with change from 384x3 to 360x3 on cheyenne.
Will this change answers?
No
Will you be implementing this yourself?
Yes
The text was updated successfully, but these errors were encountered: