Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derecho transition: Tests and test infrastructure #1995

Closed
4 tasks done
slevis-lmwg opened this issue May 4, 2023 · 18 comments · Fixed by #2269
Closed
4 tasks done

Derecho transition: Tests and test infrastructure #1995

slevis-lmwg opened this issue May 4, 2023 · 18 comments · Fixed by #2269
Assignees
Labels
investigation Needs to be verified and more investigation into what's going on.

Comments

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented May 4, 2023

  • Change test list: move cheyenne tests to derecho, change PE layouts required (just to get working)
  • Add more tests for nvhpc (tried, but there are issues)
  • Replace intel tests and also add some intel-oneapi compiler tests (tried but there are issues)
  • Add derecho support in run_sys_tests
@billsacks
Copy link
Member

I took the liberty of adding some items to your list; I hope you don't mind.

@slevis-lmwg slevis-lmwg added the investigation Needs to be verified and more investigation into what's going on. label Jul 18, 2023
@ekluzek
Copy link
Collaborator

ekluzek commented Aug 1, 2023

Derecho (derecho.hpc.ucar.edu) will be available for all next week. But, I am able to login today. We talked about derecho in CSEG, so I'm updating the above list.

Some points:

  • cray compiler is an option, but it doesn't currently work
  • nvhpc is the only compiler you can use on GPU's
  • Nodes are 128 processors, so our concurrent PE layouts might need some rethinking.
  • intel, intel-oneapi, and intel-classic are the intel compiler options. intel and intel-classic are almost the same, so I suggest we only test intel-oneapi and intel-classic (intel is kind of a hybrid between the two).

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Aug 1, 2023
@ekluzek
Copy link
Collaborator

ekluzek commented Aug 1, 2023

Note, that intel-classic and intel will be deprecated, and intel-oneapi will eventually be the only option. intel-oneapi is using the LLVM based compilers for both C and FORTRAN.

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 3, 2023

I ran my first test on derecho to just see what would happen. So ran

SMS_D_Ld3.f10_f10_mg37.I2000Clm50BgcCru.derecho_intel-oneapi.clm-default

The default PE layout becomes a problem immediately with this error in cesm.log:

dec2243.hsn.de.hpc.ucar.edu 1:  decompInit_lnd(): Number of processes exceeds number of land grid cells
dec2243.hsn.de.hpc.ucar.edu 1:          256         253

The default PE layout is for 256 processors which is 2 nodes. So I'll try with one node.

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 3, 2023

Using one node 128 processors, it now dies with the following...

dec2433.hsn.de.hpc.ucar.edu 126: MOSART decomp info proc =       126 begr =    255151 endr =    257175 numr =      2025
dec2433.hsn.de.hpc.ucar.edu 127: MOSART decomp info proc =       127 begr =    257176 endr =    259200 numr =      2025
dec2433.hsn.de.hpc.ucar.edu 24: forrtl: error (65): floating invalid
dec2433.hsn.de.hpc.ucar.edu 24: Image              PC                Routine            Line        Source
dec2433.hsn.de.hpc.ucar.edu 24: libpthread-2.31.s  000014971230F8C0  Unknown               Unknown  Unknown
dec2433.hsn.de.hpc.ucar.edu 24: cesm.exe           00000000031F67BE  mosart_init              2587  RtmMod.F90
dec2433.hsn.de.hpc.ucar.edu 24: cesm.exe           0000000003196901  rtmini                   1298  RtmMod.F90
dec2433.hsn.de.hpc.ucar.edu 24: cesm.exe           00000000031464CD  initializerealize         498  rof_comp_nuopc.F90
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971ADB5B69  callVFuncPtr             2167  ESMCI_FTable.C
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971ADB4BA8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971B23D792  enter                    2321  ESMCI_VMKernel.C
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971B226E70  enter                    1216  ESMCI_VM.C
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971ADB5F4F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971B88C3B8  esmf_compmod_mp_e        1223  ESMF_Comp.F90
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971C21EFE5  esmf_gridcompmod_        1412  ESMF_GridComp.F90
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971CCBDEC0  nuopc_driver_mp_l        2886  NUOPC_Driver.F90
dec2433.hsn.de.hpc.ucar.edu 24: libesmf.so         000014971CCA62BD  nuopc_driver_mp_i        1979  NUOPC_Driver.F90

@billsacks
Copy link
Member

Thanks for starting to work on this, @ekluzek ! My understanding is that intel-oneapi is somewhat bleeding edge and that the default will probably be standard "intel" for now. So I think most of our testing should be on "intel"; I wasn't clear from Tuesday's CSEG meeting whether it's worth having testing for intel-oneapi in addition, but I think we should follow the lead of Jim & Chris on that.

@billsacks
Copy link
Member

From today's ctsm-software discussion: tentative thought is to try to have the test list switched over by end of September. When we're ready to switch this over, we'll stop testing on cheyenne (not try to have aux_clm tests run on both cheyenne and derecho at once, because that's a pain).

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 3, 2023

This article confirms a end of year timeline for Cheyenne to be shutdown for good...

https://arc.ucar.edu/articles/452

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 3, 2023

Running the straight up intel compiler option on one node PASSes as we'd hope:

SMS_D_Ld3_P128x1.f10_f10_mg37.I2000Clm50BgcCru.derecho_intel.clm-default

@slevis-lmwg
Copy link
Contributor Author

From today's ctsm software meeting:
Though much of glade will go away with the transition to derecho, campaign will stay.

  • /campaign is visible from derecho and from cheyenne and casper.
  • @ekluzek would like a people directory; people should then be responsible for moving their own personal stuff there.
  • Will suggests moving /glade/p/cgd/tss as is to campaign.
  • We’ll want to move some things elsewhere: baselines, forcing data, etc.
  • We will continue this discussion in a TSS meeting.

@wwieder
Copy link
Contributor

wwieder commented Aug 10, 2023

@ekluzek can you create a /glade/campaign/cgd/tss/people directory that has the right permissions for people to add stuff there? subsequently, we can ask people to migrate files and clean up the tss directory.

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 10, 2023

I've added the directory. And I've started asking individuals to move their stuff over. I've got sent to most people. There's a few others to ask. Also Jackie has a directory under there what should happen to it? I don't know if Jackie still has access?

@ekluzek
Copy link
Collaborator

ekluzek commented Sep 7, 2023

@fischer-ncar points out f10 tests are failing on Derecho because we don't have this setup in our config_pes file:

Here's a list of clm prealpha tests that failed with too many tasks.  They're all f10_f10_mg37.

FAIL SMS_Lm1.f10_f10_mg37.I1850Clm50BgcCropCmip6waccm.derecho_gnu.clm-basic RUN time=18
dec2449.hsn.de.hpc.ucar.edu 128:  decompInit_lnd(): Number of processes exceeds number of land grid cells     256     253

FAIL DAE_C2_D_Lh12.f10_f10_mg37.I2000Clm50BgcCrop.derecho_intel.clm-DA_multidrv RUN time=42
dec0516.hsn.de.hpc.ucar.edu 11:  decompInit_lnd(): Number of processes exceeds number of land grid cells
dec0516.hsn.de.hpc.ucar.edu 11:      256     253

FAIL MULTINOAIS_Ly2.f10_f10_ais8gris4_mg37.I1850Clm50SpRsGag.derecho_intel.cism-change_params RUN time=15
dec0533.hsn.de.hpc.ucar.edu 4:  decompInit_lnd(): Number of processes exceeds number of land grid cells
dec0533.hsn.de.hpc.ucar.edu 4:      256     253

FAIL NCK_Ld1.f10_f10_mg37.I2000Clm50Sp.derecho_intel.clm-default RUN time=72
case2run
dec2146.hsn.de.hpc.ucar.edu 128:  decompInit_lnd(): Number of processes exceeds number of land grid cells
dec2146.hsn.de.hpc.ucar.edu 128:      256     253

FAIL SMS_Lm13.f10_f10_mg37.I1850Clm50SpG.derecho_intel RUN time=17
                           dec0370.hsn.de.hpc.ucar.edu 25:  decompInit_lnd(): Number of processes exceeds number of land grid cells
                           dec0370.hsn.de.hpc.ucar.edu 25:      256     253

@ekluzek
Copy link
Collaborator

ekluzek commented Sep 8, 2023

We are going to let these tests fail on Derecho for CESM until we can get to working on this.

@ekluzek
Copy link
Collaborator

ekluzek commented Sep 21, 2023

@fischer-ncar points out another test that failed. This probably goes to show that we'll need to change our threading tests to PE counts that will work well on Derecho.

I have another ctsm prealpha test that failed on derecho due to the task count.

FAIL ERP_P72x2_D_Ld5.f19_g17_gris4.I1850Clm50BgcCropG.derecho_intel.clm-glcMEC_increase RUN time=1222

When I changed the test to use P64x2 it passed.

@samsrabin samsrabin added this to the Cheyenne shutdown milestone Oct 11, 2023
@samsrabin samsrabin changed the title Cheyenne to Derecho transition, TODOs Derecho transition: Tests and test infrastructure Oct 16, 2023
@samsrabin
Copy link
Collaborator

@slevis-lmwg and @ekluzek: I've moved two of the items from this issue to standalone issues and edited the original post here to reflect that. Then, because there are a lot of posts here about testing, I left this as an omnibus issue for test-related work. Hope that's okay!

@ekluzek
Copy link
Collaborator

ekluzek commented Oct 19, 2023

I think @samsrabin mentioned this and I questioned it, but he is right, you can't use derecho out of the box on derecho. You need to update ccs_config to at least ccs_config_cesm0.0.72. We are about 5 tags before that one.

@ekluzek
Copy link
Collaborator

ekluzek commented Nov 11, 2023

I made the changes I saw for run_sys_tests on the CESM3_dev branch, but it wasn't working. With @billsacks help I was able to diagnose it and get it working there. So that can readily be moved to main-dev.

@samsrabin samsrabin removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigation Needs to be verified and more investigation into what's going on.
Development

Successfully merging a pull request may close this issue.

5 participants