Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable asyncio using pio #325

Merged
merged 21 commits into from
Jan 26, 2023
Merged

Conversation

jedwards4b
Copy link
Collaborator

@jedwards4b jedwards4b commented Dec 6, 2022

Description of changes

Allows IO tasks to be independent of compute tasks in cesm

Specific notes

(testing in progress)
Contributors other than yourself, if any:
Depends on share (ESCOMP/CESM_share#37) and cime (ESMCI/cime#4340).

CMEPS Issues Fixed (include github issue #):

Are changes expected to change answers? (specify if bfb, different at roundoff, more substantial)

Any User Interface Changes (namelist or namelist defaults changes)?

Testing performed

Testing performed if application target is CESM:

  • (recommended) CIME_DRIVER=nuopc scripts_regression_tests.py
    • machines:
    • details (e.g. failed tests):
  • (recommended) CESM testlist_drv.xml
    • machines and compilers:
    • details (e.g. failed tests):
  • (optional) CESM prealpha test
    • machines and compilers cheyenne intel
    • details (e.g. failed tests): results consistant with cesm2_3_alpha10d
  • (other) please described in detail
    • machines and compilers
    • details (e.g. failed tests):

Testing performed if application target is UFS-coupled:

  • (recommended) UFS-coupled testing
    • description:
    • details (e.g. failed tests):

Testing performed if application target is UFS-HAFS:

  • (recommended) UFS-HAFS testing
    • description:
    • details (e.g. failed tests):

Hashes used for testing:

  • CESM:
  • UFS-coupled, then umbrella repostiory to check out and associated hash:
    • repository to check out:
    • branch/hash:
  • UFS-HAFS, then umbrella repostiory to check out and associated hash:
    • repository to check out:
    • branch/hash:

@jedwards4b jedwards4b requested a review from billsacks December 6, 2022 22:58
@jedwards4b jedwards4b self-assigned this Dec 6, 2022
Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work on this @jedwards4b ! This is a really exciting new feature!

I have a few comments below. Most of them are minor & easy. The main one that might be more involved is my long comment, particularly the second paragraph starting "Part of what I'm wondering about here...". Let me know if you'd like to talk about this.

@@ -53,6 +62,14 @@ subroutine SetServices(ensemble_driver, rc)
specRoutine=SetModelServices, rc=rc)
if (chkerr(rc,__LINE__,u_FILE_u)) return

! ModifyCplLists is a NUOPC specialization which happens after Advertize but before Realize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be changed to say "PostChildrenAdvertise is a..."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 270 to 278
if(pio_asyncio_stride == 0 .or. modulo(n,pio_asyncio_rootpe+1) .ne. 0) then
petList(petcnt) = currentpet
petcnt = petcnt+1
if (currentpet == localPet) comp_task=.true.
else
asyncio_petlist(iopetcnt) = currentpet
iopetcnt = iopetcnt + 1
if (currentpet == localPet) asyncio_task=.true.
endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one of these blocks (I'm thinking the first one) apply when not using asyncio? If so, I think this would be more clear and robust if there was an explicit check on whether asyncio is being used, or at least a comment here explaining this.

Part of what I'm wondering about here is whether there is an assumption here - and maybe elsewhere - that, if you're not using asyncio, then pio_asyncio_ntasks, pio_asyncio_stride and pio_asyncio_rootpe are at their default values. I'm imagining a scenario where someone first enables asyncio and tweaks those settings, but then wants to try rerunning with asyncio disabled. Do they need to explicitly reset those three variables to their defaults in that situation? If so, that seems error-prone. Can this (and maybe some other code) be written to examine pio_async_interface and ignore those three variables if that is false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you may be right here. I'll create a few scenarios to test this and put in a fix if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to get to the pio_async_interface flags at this point in the fortran and I am thinking about implementing something in the buildnml instead - would that be acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that sounds like a good plan - thanks. Then can you just add a comment here saying which of these blocks (if either) applies to the case without asyncio?

Comment on lines 934 to 935
! call driver_pio_init(driver, rc=rc)
! if (chkerr(rc,__LINE__,u_FILE_u)) return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to go ahead and remove these commented-out lines, along with the two comment lines above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 1182 to 1183
! call driver_pio_component_init(driver, size(comps), rc)
! if (chkerr(rc,__LINE__,u_FILE_u)) return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to go ahead and remove these commented-out lines, along with the two comment lines above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 158 to 160
call NUOPC_CompAttributeGet(ensemble_driver, name="glc_avg_period", value=glc_avg_period, rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
read(cvalue,*) glc_avg_period
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you don't use glc_avg_period, so you could remove these lines. (I see you just moved this from below, so this is a pre-existing thing.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


call NUOPC_CompAttributeGet(gcomp(i), name="pio_typename", value=cval, rc=rc)
call NUOPC_CompAttributeGet(gcomp(i), name="pio_async_interface", value=cval, rc=rc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, it looks like this block of code from here to line 354 duplicates the above block of code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for spotting that.

Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this looks great now. I'm happy with how you handled the asyncio settings in buildnml. Thanks!

private InitializeIPDv03p3 ! realize connected Fields with transfer action "provide"
private InitializeIPDv03p4 ! optionally modify the decomp/distr of transferred Grid/Mesh
private InitializeIPDv03p5 ! realize all Fields with transfer action "accept"
private AdvertiseFields ! advertise fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these.

Copy link
Collaborator

@DeniseWorthen DeniseWorthen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All UWM test pass.

@jedwards4b
Copy link
Collaborator Author

Retested prealpha against alpha12b

 ./cs.status.20230125_114302_bxq3t5 -f | grep -v NLCOMP | grep -v BFAIL
20230125_114302_bxq3t5: 87 tests
    FAIL DAE_N2_D_Lh12_Vmct.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_intel.clm-DA_multidrv TPUTCOMP Error: TPUTCOMP: Computation time increase > 25% from baseline
    FAIL ERP_D_Ln9_Vnuopc.C48_C48_mg17.QPC6.cheyenne_intel.cam-outfrq9s MODEL_BUILD time=245
    FAIL ERP_Ld3_Vnuopc.f09_f09_mg17.FCfireHIST.cheyenne_intel.cam-outfrq1d BASELINE cesm2_3_alpha12b: DIFF
    FAIL IRT_C3_Ld7.f19_g17.BHIST.cheyenne_intel.allactive-defaultio RUN time=7222
    PEND IRT_C3_Ld7.f19_g17.BHIST.cheyenne_intel.allactive-defaultio COMPARE_base_restart
    FAIL MCC.f19_g17.B1850.cheyenne_intel.allactive-defaultiomi RUN time=7231
    PEND MCC.f19_g17.B1850.cheyenne_intel.allactive-defaultiomi COMPARE_base_single_instance
    FAIL MCC_Ld5.f19_g17.B1850G.cheyenne_intel.allactive-cism-test_coupling RUN time=5427
    PEND MCC_Ld5.f19_g17.B1850G.cheyenne_intel.allactive-cism-test_coupling COMPARE_base_single_instance
    FAIL SMS_D_Ld1_PS.f09_g17.I1850Clm50BgcSpinup.cheyenne_intel.clm-cplhist RUN time=46
    FAIL SMS_D_Ln9_Vnuopc_P720x1.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCnudged.cheyenne_intel.cam-outfrq9s RUN time=1216
    FAIL SMS_Ld1_P144_D.T62_g17.C.cheyenne_intel.pop-144blocks_320x384_spacecurve RUN time=43244
    FAIL SMS_Ld2_P80_D.T62_g37.C1850ECO.cheyenne_intel.pop-ecosys_81blocks_100x116_spacecurve RUN time=43257

@fischer-ncar these all look like failures in alpha12b - can I go ahead and merge?

@fischer-ncar
Copy link
Contributor

Yes, those are expected failures. You can go ahead and merge.

@jedwards4b jedwards4b merged commit efa1e47 into ESCOMP:main Jan 26, 2023
@jedwards4b jedwards4b deleted the pio_asyncio_in_cmeps branch January 26, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants