Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PE layout of 384x3 on cheyenne fails with updated ESMF library #673

Closed
fischer-ncar opened this issue Oct 11, 2022 · 16 comments
Closed

PE layout of 384x3 on cheyenne fails with updated ESMF library #673

fischer-ncar opened this issue Oct 11, 2022 · 16 comments
Assignees
Labels
BFB bit for bit tag next tag This issue is ready to be fixed in the next CAM tag
Milestone

Comments

@fischer-ncar
Copy link
Collaborator

Issue Type

Infrastructure Update

Issue Description

PE layout for compset="_(CAM50|CAM60)%(CC|CV|CF|WC)" with change from 384x3 to 360x3 on cheyenne.

Will this change answers?

No

Will you be implementing this yourself?

Yes

@fischer-ncar fischer-ncar added next tag This issue is ready to be fixed in the next CAM tag BFB bit for bit tag labels Oct 11, 2022
@fischer-ncar fischer-ncar self-assigned this Oct 11, 2022
@fvitt
Copy link

fvitt commented Oct 11, 2022

Do we understand why 384x3 PE layout is a problem?

@fischer-ncar
Copy link
Collaborator Author

I'll let @jedwards4b answer that. But my understanding it's a change in how ESMF handles tasks and threads.

@cacraigucar
Copy link
Collaborator

I was seeing this with our MCT WACCM tests. At a meeting today, the scientists agreed that we can drop these tests which we will be doing with our next CAM tag. Does this issue expand to tests other than this?

@jedwards4b
Copy link

Francis had a good point that I did not follow up on. That is that 384x3 fits properly on 32 cheyenne nodes and so my reasoning for this change doesn't make sense. I'll look into it further.

@fischer-ncar
Copy link
Collaborator Author

These are the prealpha tests that failed with 384x3, but pass with 360x3.

SMS_D_Ln9_Vnuopc.f09_f09_mg17.FCHIST.cheyenne_intel.cam-outfrq9s_ocnemis
SMS_D_Ln9_Vnuopc.f09_f09_mg17.FCts2SD.cheyenne_intel.cam-outfrq9s R
SMS_D_Ln9_Vnuopc.f09_f09_mg17.FWHIST.cheyenne_intel.cam-reduced_hist3s
SMS_D_Ln9_Vnuopc.f09_f09_mg17.FWma2000climo.cheyenne_intel.cam-outfrq9s
SMS_D_Ln9_Vnuopc.f09_f09_mg17.FWsc1850.cheyenne_intel.cam-outfrq9s
SMS_D_Ln9_Vnuopc.f09_f09_mg17.FWscHIST.cheyenne_intel.cam-outfrq9s

@cacraigucar
Copy link
Collaborator

@fischer-ncar - What are the error messages that you are getting from these runs and have you verified they are repeatable failures? I ran CAM regression test last night for an upcoming CAM tag and I had several weird failures which I'd not seen before in previous testing. I think cheyenne was giving me machine hiccups and I'm running them again. Also, the test ERP_Ld3_Vnuopc.f09_f09_mg17.FWHIST.cheyenne_intel.cam-reduced_hist1d1 ran fine for me last night and it used the 384x3 PE layout. I personally have not had any failures using the 384x3 layouts in anything other than the MCT tests.

@cacraigucar
Copy link
Collaborator

Also, I should mention that cam6_3_079 which is for PR #666 will have the problematic MCT WACCM tests removed. I'm hopeful that tag will be coming in the next day or two (dependent on how cheyenne treats my tests)

@fischer-ncar
Copy link
Collaborator Author

@cacraigucar The failures are repeatable, and the error message I'm getting is.

MPT ERROR: Rank 297(g:297) received signal SIGFPE(8).

@peverwhee peverwhee moved this to To Do in CAM Development Oct 18, 2022
@peverwhee peverwhee added this to the CESM2.3 milestone Oct 18, 2022
@fischer-ncar
Copy link
Collaborator Author

@fvitt @peverwhee @nusbaume Do we want to get this issue fixed in the tag that's going into alpha10c? Or do you want to wait for @jedwards4b to look into this issue?

@fvitt
Copy link

fvitt commented Oct 25, 2022

It would be good to understand why the 384x3 PE layout is a problem. I vote for waiting for Jim to look into it.

@nusbaume
Copy link
Collaborator

I agree with Francis, it would be good to get to the bottom of this, which probably means waiting for Jim.

@jedwards4b
Copy link

Running test SMS_D_Ln9_Vnuopc.f09_f09_mg17.FCHIST.cheyenne_intel.cam-outfrq9s_ocnemis I set a breakpoint at
dshr_mod.F90 line 1417 which is a call to ESMF_FieldRegridStore. When I try to step into that function many of the tasks appear to hang.

@jedwards4b
Copy link

Fixed by ESCOMP/CDEPS#194

@jedwards4b
Copy link

@fischer-ncar do you agree that this ticket can be closed now?

@cacraigucar
Copy link
Collaborator

We have that this PR will be closed when we update the externals in CAM (PR #700) as we believe that the CDEPS update will fix the problem, correct? PR #700 is in final testing now.

Repository owner moved this from To Do to Done in CAM Development Nov 28, 2022
@fischer-ncar
Copy link
Collaborator Author

Bit late, but yes this ticket can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB bit for bit tag next tag This issue is ready to be fixed in the next CAM tag
Projects
Status: Done
Development

No branches or pull requests

6 participants