Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded tests are slow on Derecho #2289

Open
ekluzek opened this issue Dec 12, 2023 · 7 comments
Open

Threaded tests are slow on Derecho #2289

ekluzek opened this issue Dec 12, 2023 · 7 comments
Assignees
Labels
bug something is working incorrectly

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Dec 12, 2023

Brief summary of bug

Longer threaded tests seem to be hanging on Derecho with release-cesm2.2.04

ERP_D_Ld10_P64x2.f10_f10_musgs.IHistClm50BgcCrop.derecho_intel.clm-ciso_decStart
ERP_Ly3_P64x2.f10_f10_musgs.IHistClm50BgcCrop.derecho_intel.clm-cropMonthOutput
ERP_P64x2_Lm25.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-monthly
ERP_P64x2_Lm36.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-clm50cropIrrigMonth_interp
ERP_P64x2_Lm7.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-irrig_alternate_monthly
ERP_P64x2_Ly3.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-irrig_o3_reduceOutput
ERS_Ly3_P64x2.f10_f10_musgs.IHistClm50BgcCropG.derecho_intel.clm-cropMonthOutput

General bug information

CTSM version you are using: release-cesm2.2.03-12-g1b4fa3602

Does this bug cause significantly incorrect results in the model's science? No?

Configurations affected: Threaded tests that go more than a year and a half simulation years

Details of bug

These tests were originally on Cheyenne with a PE layout of 36x2 which worked in release-cesm2.2.02. On Derecho they seem to hang after it gets longer and run out of wallclock time of 1:40.

Important output or errors that show the problem

cesm.log:

dec0275.hsn.de.hpc.ucar.edu 63: forrtl: error (78): process killed (SIGTERM)
dec0275.hsn.de.hpc.ucar.edu 63: Image              PC                Routine            Line        Source
dec0275.hsn.de.hpc.ucar.edu 63: libpthread-2.31.s  0000145589C668C0  Unknown               Unknown  Unknown
dec0275.hsn.de.hpc.ucar.edu 63: libmpi_intel.so.1  00001455890100C5  Unknown               Unknown  Unknown
dec0275.hsn.de.hpc.ucar.edu 63: libmpi_intel.so.1  00001455879B60C9  Unknown               Unknown  Unknown
dec0275.hsn.de.hpc.ucar.edu 63: libmpi_intel.so.1  00001455879EE0D6  Unknown               Unknown  Unknown
dec0275.hsn.de.hpc.ucar.edu 63: libmpi_intel.so.1  00001455879EF31F  MPI_Wait              Unknown  Unknown
dec0275.hsn.de.hpc.ucar.edu 63: libmpifort_intel.  00001455897EA3BC  mpi_wait              Unknown  Unknown
dec0275.hsn.de.hpc.ucar.edu 63: cesm.exe           000000000131B9C8  m_spmdutils_mp_m_         877  m_SPMDutils.F90
dec0275.hsn.de.hpc.ucar.edu 63: cesm.exe           000000000130F1E2  m_rearranger_mp_r        1132  m_Rearranger.F90
dec0275.hsn.de.hpc.ucar.edu 63: cesm.exe           0000000001301CA1  m_matattrvectmul_         514  m_MatAttrVectMul.F90
dec0275.hsn.de.hpc.ucar.edu 63: cesm.exe           0000000000CD46A1  mosart_physics_mo         145  MOSART_physics_mod.F90
dec0275.hsn.de.hpc.ucar.edu 63: cesm.exe           0000000000CA4D53  rtmmod_mp_rtmrun_        1830  RtmMod.F90
dec0275.hsn.de.hpc.ucar.edu 63: cesm.exe           0000000000C883AC  rof_comp_mct_mp_r         299  rof_comp_mct.F90
dec0275.hsn.de.hpc.ucar.edu 63: cesm.exe           0000000000436B2E  component_mod_mp_         737  component_mod.F90
dec0275.hsn.de.hpc.ucar.edu 63: cesm.exe           000000000041BFB0  cime_comp_mod_mp_        2638  cime_comp_mod.F90

lnd.log:

 nstep =      23926   TS =   274.112624074611290
 nstep =      23927   TS =   274.085864720367340
 nstep =      23928   TS =   274.022571240220486
 nstep =      23929   TS =   273.960032305360414
 nstep =      23930   TS =   274.039398595550153
 nstep =      23931   TS =   273.989408017349092
(shr_dmodel_readstrm) file ub: /glade/campaign/cesm/cesmdata/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2012_climo1995-2011.T62.lnfm_Total_c140423.nc    1070
 nstep =      23932   TS =   273.916223972565433
 nstep =      23933   TS =   273.782815379037402
 nstep =      23934   TS =   273.618251977645627
 nstep =      23935   TS =   273.452100094584694
@ekluzek ekluzek added bug something is working incorrectly next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Dec 12, 2023
@ekluzek ekluzek self-assigned this Dec 12, 2023
ekluzek added a commit to ekluzek/CTSM that referenced this issue Dec 12, 2023
@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 13, 2023

On @briandobbins suggestion I ran on Cheyenne (for release-cesm2.2.02 because Cheyenne isn't in the externals for release-cesm2.2.03 or "04"), and it completed in a reasonable amount of time.

PASS ERP_Ly3_P36x2.f10_f10_musgs.IHistClm50BgcCrop.cheyenne_intel.clm-cropMonthOutput RUN time=3481

I did get a 7month test to pass as follows...
PASS ERP_P128x2_Lm7.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-irrig_alternate_monthly RUN time=2951

Which suggests that giving the 3 year tests more time 4.5 hours might work.

Note, also that performance in latest CTSM is poor for the threading tests compared:
ctsm_baselines/ctsm5.1.dev158> grep RUN Px2f10/TestStatus | grep Ly
ERP_Ly3_P64x2.f10_f10_mg37.IHistClm50BgcCrop.derecho_intel.clm-cropMonthOutput/TestStatus:PASS ERP_Ly3_P64x2.f10_f10_mg37.IHistClm50BgcCrop.derecho_intel.clm-cropMonthOutput RUN time=12943
ERP_P64x2_Ly3.f10_f10_mg37.I2000Clm50BgcCrop.derecho_intel.clm-irrig_o3falk_reduceOutput/TestStatus:PASS ERP_P64x2_Ly3.f10_f10_mg37.I2000Clm50BgcCrop.derecho_intel.clm-irrig_o3falk_reduceOutput RUN time=12588
ERS_Ly3_P64x2.f10_f10_mg37.IHistClm50BgcCropG.derecho_intel.clm-cropMonthOutput/TestStatus:PASS ERS_Ly3_P64x2.f10_f10_mg37.IHistClm50BgcCropG.derecho_intel.clm-cropMonthOutput RUN time=16504

While MPI-only tests are much faster:
ctsm_baselines/ctsm5.1.dev158> grep RUN Px1f10/TestStatus | grep Ly
ERS_Ly5_P128x1.f10_f10_mg37.IHistClm45BgcCrop.derecho_intel.clm-cropMonthOutput/TestStatus:PASS ERS_Ly5_P128x1.f10_f10_mg37.IHistClm45BgcCrop.derecho_intel.clm-cropMonthOutput RUN time=1782
ERS_Ly5_P128x1.f10_f10_mg37.IHistClm51BgcCrop.derecho_intel.clm-cropMonthOutput/TestStatus:PASS ERS_Ly5_P128x1.f10_f10_mg37.IHistClm51BgcCrop.derecho_intel.clm-cropMonthOutput RUN time=2651
LGRAIN2_Ly1_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-ciso--clm-cropMonthOutput/TestStatus:PASS LGRAIN2_Ly1_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-ciso--clm-cropMonthOutput RUN time=826
LGRAIN2_Ly2_P128x1.f10_f10_mg37.I1850Clm45BgcCrop.derecho_gnu.clm-ciso--clm-cropMonthOutput/TestStatus:PASS LGRAIN2_Ly2_P128x1.f10_f10_mg37.I1850Clm45BgcCrop.derecho_gnu.clm-ciso--clm-cropMonthOutput RUN time=1190
LREPRSTRUCT_Ly1_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-ciso--clm-cropMonthOutput/TestStatus:PASS LREPRSTRUCT_Ly1_P128x1.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-ciso--clm-cropMonthOutput RUN time=807
LREPRSTRUCT_Ly2_P128x1.f10_f10_mg37.I1850Clm45BgcCrop.derecho_gnu.clm-ciso--clm-cropMonthOutput/TestStatus:PASS LREPRSTRUCT_Ly2_P128x1.f10_f10_mg37.I1850Clm45BgcCrop.derecho_gnu.clm-ciso--clm-cropMonthOutput RUN time=1175

@ekluzek ekluzek changed the title Longer threaded tests are hanging on Derecho with cesm2.2.04 Threaded tests are slow on Derecho Dec 14, 2023
@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 14, 2023

This is a larger issue, CAM is seeing this as well...

ESCOMP/CAM#941 (comment)

And appears to be an issue for all versions. I had thought it was tests hanging, but looks like it's just abysmal performance.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 14, 2023

Jian Sun, had this to say in an exchange in cgd-cseg...

Hi Cheryl,

Thank you for sharing the CSEG meeting notes.

Thanks to Francis' suggestion, I am able to restore the threading performance of CAM by manually revising the MPI run command as below:
export OMP_PLACES=threads ; export OMP_PROC_BIND=close ; mpiexec --cpu-bind depth -n 64 -ppn 64 -d 2 ./cesm.exe
My current finding is that the poor threading performance of CAM on Derecho is due to the missing of those arguments. Rory from CISL is working on a wrapper script that will replace the long MPI command above with a single command (e.g., "mpibind"), and it should work for both MPI-exclusive and hybrid MPI/OpenMP jobs. I will test it once the script is in good shape and maybe add it to CAM / CIME later.

Hope this clarifies and let me know if you have any questions.

Thanks,
Jian

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2023

Threaded tests failing to run in reasonable time with release-clm5.0 branch:

ERP_P180x2_D.f19_g17.I2000Clm50SpRtmFl.derecho_intel.clm-default
ERP_P180x2_D_Ld5.f19_g17.I2000Clm50Sp.derecho_intel.clm-default
ERP_P180x2_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.derecho_intel.clm-default
ERP_P180x2_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.derecho_intel.clm-default

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 28, 2023

Threading on Cheyenne seemed to be more comparable:

For tests in ctsm5.1.dev157

1-year MPI-only
LGRAIN2_Ly1_P72x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_gnu.clm-ciso--clm-cropMonthOutput/TestStatus:PASS LGRAIN2_Ly1_P72x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_gnu.clm-ciso--clm-cropMonthOutput RUN time=1438
LREPRSTRUCT_Ly1_P72x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_gnu.clm-ciso--clm-cropMonthOutput/TestStatus:PASS LREPRSTRUCT_Ly1_P72x1.f10_f10_mg37.I1850Clm50BgcCrop.cheyenne_gnu.clm-ciso--clm-cropMonthOutput RUN time=1443

13month threaded
ERP_P36x2_Lm13.f10_f10_mg37.IHistClm51Bgc.cheyenne_gnu.clm-monthly/TestStatus:PASS ERP_P36x2_Lm13.f10_f10_mg37.IHistClm51Bgc.cheyenne_gnu.clm-monthly RUN time=1447
ERP_P36x2_Lm13.f10_f10_mg37.IHistClm51Bgc.cheyenne_intel.clm-monthly/TestStatus:PASS ERP_P36x2_Lm13.f10_f10_mg37.IHistClm51Bgc.cheyenne_intel.clm-monthly RUN time=1084

@ekluzek ekluzek removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jan 4, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 4, 2024

@briandobbins will look into this and get back to us.

@briandobbins
Copy link
Contributor

The process placement fix does indeed solve this problem; runtimes (as of the first 50 steps) are ~10x faster.

Ultimately, the fix will be simply changing how we launch the MPI jobs, which is located in ccs_config_cesm, but the new script to do this properly for threaded jobs is still a work-in-progress by CISL. I'll update this issue once it's set, and will create a PR that updates CIME to a newer tag that includes the updated ccs_config_cesm version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

2 participants