-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threaded tests are slow on Derecho #2289
Comments
On @briandobbins suggestion I ran on Cheyenne (for release-cesm2.2.02 because Cheyenne isn't in the externals for release-cesm2.2.03 or "04"), and it completed in a reasonable amount of time. PASS ERP_Ly3_P36x2.f10_f10_musgs.IHistClm50BgcCrop.cheyenne_intel.clm-cropMonthOutput RUN time=3481 I did get a 7month test to pass as follows... Which suggests that giving the 3 year tests more time 4.5 hours might work. Note, also that performance in latest CTSM is poor for the threading tests compared: While MPI-only tests are much faster: |
This is a larger issue, CAM is seeing this as well... And appears to be an issue for all versions. I had thought it was tests hanging, but looks like it's just abysmal performance. |
Jian Sun, had this to say in an exchange in cgd-cseg...
|
Threaded tests failing to run in reasonable time with release-clm5.0 branch: ERP_P180x2_D.f19_g17.I2000Clm50SpRtmFl.derecho_intel.clm-default |
Threading on Cheyenne seemed to be more comparable: For tests in ctsm5.1.dev157
|
@briandobbins will look into this and get back to us. |
The process placement fix does indeed solve this problem; runtimes (as of the first 50 steps) are ~10x faster. Ultimately, the fix will be simply changing how we launch the MPI jobs, which is located in ccs_config_cesm, but the new script to do this properly for threaded jobs is still a work-in-progress by CISL. I'll update this issue once it's set, and will create a PR that updates CIME to a newer tag that includes the updated ccs_config_cesm version. |
Brief summary of bug
Longer threaded tests seem to be hanging on Derecho with release-cesm2.2.04
ERP_D_Ld10_P64x2.f10_f10_musgs.IHistClm50BgcCrop.derecho_intel.clm-ciso_decStart
ERP_Ly3_P64x2.f10_f10_musgs.IHistClm50BgcCrop.derecho_intel.clm-cropMonthOutput
ERP_P64x2_Lm25.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-monthly
ERP_P64x2_Lm36.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-clm50cropIrrigMonth_interp
ERP_P64x2_Lm7.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-irrig_alternate_monthly
ERP_P64x2_Ly3.f10_f10_musgs.I2000Clm50BgcCrop.derecho_intel.clm-irrig_o3_reduceOutput
ERS_Ly3_P64x2.f10_f10_musgs.IHistClm50BgcCropG.derecho_intel.clm-cropMonthOutput
General bug information
CTSM version you are using: release-cesm2.2.03-12-g1b4fa3602
Does this bug cause significantly incorrect results in the model's science? No?
Configurations affected: Threaded tests that go more than a year and a half simulation years
Details of bug
These tests were originally on Cheyenne with a PE layout of 36x2 which worked in release-cesm2.2.02. On Derecho they seem to hang after it gets longer and run out of wallclock time of 1:40.
Important output or errors that show the problem
cesm.log:
lnd.log:
The text was updated successfully, but these errors were encountered: