Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create 'coupler.res' log file in write grid comp. Explicitly specify chunk sizes in restart files #2021

Merged

Conversation

DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA DusanJovic-NOAA commented Nov 30, 2023

PR Author Checklist:

  • I have linked PR's from all sub-components involved in section below.

  • I am confirming reviews are completed in ALL sub-component PR's.

  • I have run the full RT suite on either Hera/Cheyenne AND have attached the log to this PR below this line:

  • I have added the list of all failed regression tests to "Anticipated changes" section.

  • I have filled out all sections of the template.

Description

When the restart files are written by the write grid component, the log file (coupler.res) must also be written by write grid comp to ensure that it is written after all other restart files have already be created.

The chunk sizes are now explicitly specified in restart file when quilting_restart is used, and are equal to the Nx x Ny in horizontal and 1 in all other dimensions.

Linked Issues and Pull Requests

Associated UFSWM Issue to close

Closes #2020

Subcomponent Pull Requests

NOAA-EMC/fv3atm/pull/726

Blocking Dependencies

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Anticipated Changes

Input data

  • No changes are expected to input data.
  • Changes are expected to input data:
    • New input data.
    • Updated input data.

Regression Tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:
Tests effected by changes in this PR:

Libraries

  • Not Needed
  • Needed
    • Create separate issue in JCSDA/spack-stack asking for update to library. Include library name, library version.
    • Add issue link from JCSDA/spack-stack following this item
Code Managers Log
  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.
    • N/A

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 4, 2023

@DusanJovic-NOAA rrfs team requested to commit this pr first for operational application. Can you sync up the branch?

junwang-noaa
junwang-noaa previously approved these changes Dec 4, 2023
@zach1221 zach1221 added No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. jenkins-ci Jenkins CI: ORT build/test on docker container and removed Waiting for Reviews The PR is waiting for reviews from associated component PR's. labels Dec 4, 2023
@epic-cicd-jenkins
Copy link
Collaborator

Jenkins-ci ORTs passed

@BrianCurtis-NOAA
Copy link
Collaborator

Acorn is sitting on it's last item and it's queued still, it's been there for at least a few hours. There are no current fail-tests in the working directory. I plan on killing the last task on Acorn so we don't need to wait. The queue on Acorn is being used by the SA's, so I don't think this long wait is about the resources needed for this test being too high.

@zach1221
Copy link
Collaborator

zach1221 commented Dec 5, 2023

Acorn is sitting on it's last item and it's queued still, it's been there for at least a few hours. There are no current fail-tests in the working directory. I plan on killing the last task on Acorn so we don't need to wait. The queue on Acorn is being used by the SA's, so I don't think this long wait is about the resources needed for this test being too high.

@BrianCurtis-NOAA ok I think we're good to go then.

@zach1221
Copy link
Collaborator

zach1221 commented Dec 5, 2023

@DusanJovic-NOAA fv3atm pr is merged. Please revert change in .gitmodules and update the submodule hash.
NOAA-EMC/fv3atm@a82381c

@BrianCurtis-NOAA
Copy link
Collaborator

Did you find any consistency in the run taking longer than 30 or a history of higher run times?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 5, 2023

Did you find any consistency in the run taking longer than 30 or a history of higher run times?

Sorry for confusing commit remarks.. Actually the case is cpld_debug_pdlib_p8 on jet. The baseline was originated from #1967. Somehow the test log wasn't reported in the report for a few PRs. But 1882 sec is not bad on jet in a sense jet is small machine. On Orion and hercules, the case is taking about 1500 sec. I am trying to confirm and recover missing baseline case on jet At least, I was catching up the case on jet from last PR and this PR. So, we can let this pr move on to be merged. I will update in tomorrow's tag up.

@BrianCurtis-NOAA
Copy link
Collaborator

@jkbk2004 Im confused. We have changes in this PR for cpld_debug_p8 but you mention cpld_debug_pdlib_p8 and missing baselines? Do we need the cpld_debug_p8 changes?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 5, 2023

@BrianCurtis-NOAA Oh! my mistake. it's cpld_debug_pdlib_p8 issue as the jet log show at the end. I am fixing now.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 5, 2023

All set now.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Dec 5, 2023

@jkbk2004 The pdlib tests were restricted to only hera, orion, wcoss2 and cheyenne, but they got turned on everywhere in #1967. So the last two weeks worth of pdlib tests should show up in the jet logs but the test does not show up in the log for that PR's commit. Why is the test not running on jet?

You can see here that the test (25) is not reported:

Comparing 20210323.060000.out_grd.ww3 .........OK
0: The total amount of wall time = 1516.470165
0: The maximum resident set size (KB) = 1637252
Test 024 cpld_mpi_pdlib_p8_intel PASS
baseline dir = /mnt/lfs4/HFIP/hfv3gfs/role.epic/RT/NEMSfv3gfs/develop-20231117/control_flake_intel
working dir = /lfs4/HFIP/h-nems/Zachary.Shrader/RT_RUNDIRS/Zachary.Shrader/FV3_RT/rt_242977/control_flake_intel
Checking test 026 control_flake_intel results ....

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 5, 2023

As I said, somehow the case was missing for the commits on Nov 22/27/29 and Dec but reported on Dec 4 and this pr. I am recovering now. cases are running and shows around 1880 sec. Anyway, no need to hold this pr. We can merge this pr. I will report the result of recovering missing jet cases in previous baseline dates tomorrow.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Dec 6, 2023

cpld_debug_pdlib_p8 is all recovered on jet for baselines: develop-20231117 and develop-20231122. timing is around 1800 ~ 1900 sec. So bumping up wlclk makes a sense. @zach1221 it's ok to proceed to merge this pr.

@zach1221
Copy link
Collaborator

zach1221 commented Dec 6, 2023

cpld_debug_pdlib_p8 is all recovered on jet for baselines: develop-20231117 and develop-20231122. timing is around 1800 ~ 1900 sec. So bumping up wlclk makes a sense. @zach1221 it's ok to proceed to merge this pr.

Ok, I'll need one more approval. Waiting on that.

@BrianCurtis-NOAA
Copy link
Collaborator

Please wait until after the CM meeting to merge.

@FernandoAndrade-NOAA FernandoAndrade-NOAA merged commit 1f7dd77 into ufs-community:develop Dec 6, 2023
@DusanJovic-NOAA DusanJovic-NOAA deleted the rrfs_coupler_res branch December 13, 2023 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jenkins-ci Jenkins CI: ORT build/test on docker container No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

restart file IO problems found in RRFS cycling test
9 participants