Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume nightly testing of GDASApp #1313

Closed
RussTreadon-NOAA opened this issue Oct 8, 2024 · 8 comments · Fixed by #1355 or #1377
Closed

Resume nightly testing of GDASApp #1313

RussTreadon-NOAA opened this issue Oct 8, 2024 · 8 comments · Fixed by #1355 or #1377
Assignees

Comments

@RussTreadon-NOAA
Copy link
Contributor

Directory ci contains the following scripts

driver.sh  gw_driver.sh  hera.sh  orion.sh  run_ci.sh  run_gw_ci.sh  stable_driver.sh

along with directory validation.

stable_driver.sh was previously run via cron to

  • clone global-workflow develop
  • update jedi hashes in sorc/gdas.cd
  • build g-w with updated sorc/gdas.cd
  • run GDASApp ctests
  • if all tests Passed push updated sorc/gdas.cd to GDASApp feature/stable-nightly

The nightly cron was turned off after several failures.

This issue is opened to document the work needed to resume nightly testing of GDASApp.

@RussTreadon-NOAA
Copy link
Contributor Author

Set up working copy of ci directory in my space on Hera. Turn off mail to Cory and Guillaume. Execute stable_driver.sh. Everything ran fine up to

+ ctest -R gdasapp --output-on-failure

Some of the queued tests failed to run within 1500 seconds of being submitted. Downstream dependent jobs failed.

The following tests FAILED:
        1953 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasstage_ic_202103241200 (Timeout)
        1954 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasfcst_202103241200 (Timeout)
        1955 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasprepoceanobs_202103241800 (Timeout)
        1956 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarinebmat_202103241800 (Timeout)
        1957 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarineanlinit_202103241800 (Failed)
        1958 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarineanlvar_202103241800 (Failed)
        1959 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarineanlchkpt_202103241800 (Failed)
        1960 - test_gdasapp_WCDA-3DVAR-C48mx500_gdasmarineanlfinal_202103241800 (Failed)
        1966 - test_gdasapp_atm_jjob_var_inc (Failed)
        1967 - test_gdasapp_atm_jjob_var_final (Failed)
        1973 - test_gdasapp_atm_jjob_ens_inc (Failed)
        1974 - test_gdasapp_atm_jjob_ens_final (Failed)

As a result the ctests returned a non-zero return code and the working copy of develop with update jedi hashes was not pushed to feature/stable-nightly.

We need a more robust mechanism to run the ctests. One could submit all the jobs to the debug queue. A potential problem here is that Hera only allows two debug jobs in the queue at a time for a user. stable_driver.sh sequentially runs ctests so this is not an issue. However, if the user were running other debug jobs there could potentially be problems.

As a test set WORKFLOW_BUILD=OFF prior to build. ctests successfully ran 24 non-workflow tests. The following git commands in stable_driver.sh worked

++ cat log.ctest
++ grep 'tests passed'
+ npassed='100% tests passed, 0 tests failed out of 24'
+ '[' 0 -eq 0 ']'
+ echo 'Tests:                                 *SUCCESS*'
++ date
+ echo 'Tests: Completed at Fri Oct  4 02:11:29 UTC 2024'
+ echo 'Tests: 100% tests passed, 0 tests failed out of 24'
+ echo '```'
+ exit 0
+ ci_status=0
+ total=0
+ '[' 0 -eq 0 ']'
+ cd /scratch1/NCEPDEV/da/Russ.Treadon/CI/GDASApp/stable/20241004/global-workflow/sorc/gdas.cd
+ git stash
No local changes to save
+ total=0
+ '[' 0 -ne 0 ']'
+ git checkout feature/stable-nightly
warning: unable to rmdir 'sorc/bufr-query': Directory not empty
warning: unable to rmdir 'sorc/da-utils': Directory not empty
Switched to a new branch 'feature/stable-nightly'
M       parm/jcb-algorithms
M       parm/jcb-gdas
M       sorc/fv3-jedi
M       sorc/ioda
M       sorc/iodaconv
M       sorc/jcb
M       sorc/oops
M       sorc/saber
M       sorc/soca
M       sorc/ufo
M       sorc/vader
branch 'feature/stable-nightly' set up to track 'origin/feature/stable-nightly'.
+ total=0
+ '[' 0 -ne 0 ']'

The next git command, git merge develop, failed with

+ git merge develop
Note: Fast-forwarding submodule sorc/fv3-jedi to 731fcf4cbf541f37ac0531b2504fcc4108e1f6ee
Failed to merge submodule sorc/oops (commits don't follow merge-base)
CONFLICT (submodule): Merge conflict in sorc/oops
Recursive merging with submodules currently only supports trivial cases.
Please manually handle the merging of each conflicted submodule.
This can be accomplished with the following steps:
 - go to submodule (sorc/oops), and either merge commit e6485c0a
   or update to an existing commit which has merged those changes
 - come back to superproject and run:

      git add sorc/oops

   to record the above merge or update
 - resolve any other conflicts in the superproject
 - commit the resulting index in the superproject
Automatic merge failed; fix conflicts and then commit the result.
+ total=1
+ '[' 1 -ne 0 ']'
+ echo 'Unable to merge develop'

@RussTreadon-NOAA
Copy link
Contributor Author

Rerun ci/stable_driver.sh on Hera under role.jedipara following merger of g-w PR #2978 into develop. As expected, several GDASApp ctest jobs failed

The following tests FAILED:
        1951 - test_gdasapp_fv3jedi_fv3inc (Failed)
        1963 - test_gdasapp_WCDA-3DVAR-C48mx500_gdas_marineanlvar_202103241800 (Failed)
        1964 - test_gdasapp_WCDA-3DVAR-C48mx500_gdas_marineanlchkpt_202103241800 (Failed)
        1965 - test_gdasapp_WCDA-3DVAR-C48mx500_gdas_marineanlfinal_202103241800 (Failed)
        1969 - test_gdasapp_atm_jjob_var_init (Failed)
        1970 - test_gdasapp_atm_jjob_var_run (Failed)
        1971 - test_gdasapp_atm_jjob_var_inc (Failed)
        1972 - test_gdasapp_atm_jjob_var_final (Failed)
        1974 - test_gdasapp_atm_jjob_ens_letkf (Failed)
        1976 - test_gdasapp_atm_jjob_ens_obs (Failed)
        1977 - test_gdasapp_atm_jjob_ens_sol (Failed)
        1978 - test_gdasapp_atm_jjob_ens_inc (Failed)
        1979 - test_gdasapp_atm_jjob_ens_final (Failed)
        1981 - test_gdasapp_bufr2ioda_insitu_profile_argo (Failed)
        1982 - test_gdasapp_bufr2ioda_insitu_profile_bathy (Failed)
        1983 - test_gdasapp_bufr2ioda_insitu_profile_glider (Failed)
        1984 - test_gdasapp_bufr2ioda_insitu_profile_tesac (Failed)
        1985 - test_gdasapp_bufr2ioda_insitu_profile_tropical (Failed)
        1986 - test_gdasapp_bufr2ioda_insitu_profile_xbtctd (Failed)
        1987 - test_gdasapp_bufr2ioda_insitu_surface_drifter (Failed)
        1988 - test_gdasapp_bufr2ioda_insitu_surface_trkob (Failed)

Below is a preliminary examination of the failures.

test_gdasapp_fv3jedi_fv3inc failed with an error that appears to be related to updated JEDI hashes bringing in changes from the Model Variable Renaming Sprint

fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Abort(1) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
Abort(1) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
Abort(1) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
fv3jedi_vc_model2geovals_mod.changevar unknown field: delp. Not in input field and no transform case specified.
slurmstepd: error: *** STEP 1720377.0 ON h22c26 CANCELLED AT 2024-10-29T13:50:50 ***

test_gdasapp_WCDA-3DVAR-C48mx500_gdas_marineanlvar_202103241800 failed with an error that appears to be related to updated JEDI hashes bringing in changes from the Model Variable Renaming Sprint

 0: OOPS_STATS IncrementalAssimilation iteration 0      - Runtime:    11.16 sec,  Local Memory:   338.18 Mb
 0: Unable to find field metadata for: sea_surface_height_above_geoid
 0: OOPS Ending   2024-10-29 14:15:32 (UTC+0000)
 2: Unable to find field metadata for: sea_surface_height_above_geoid
 4: Unable to find field metadata for: sea_surface_height_above_geoid
 6: Unable to find field metadata for: sea_surface_height_above_geoid
 8: Unable to find field metadata for: sea_surface_height_above_geoid
10: Unable to find field metadata for: sea_surface_height_above_geoid
12: Unable to find field metadata for: sea_surface_height_above_geoid
14: Unable to find field metadata for: sea_surface_height_above_geoid
 0: Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
 2: Abort(1) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
 4: Abort(1) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
 6: Abort(1) on node 6 (rank 6 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
 8: Abort(1) on node 8 (rank 8 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 8

test_gdasapp_atm_jjob_var_init failed due to inconsistent variables between g-w and GDASApp. GDASApp test/atm/global-workflow/config.yaml uses JCB_ALGO_YAML_VAR but g-w config.atmanl still uses JCB_ALGO_YAML.

    jcb_algo_config = parse_j2yaml(task_config.JCB_ALGO_YAML, task_config)
  File "/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/ush/python/wxflow/yaml_file.py", line 183, in parse_j2yaml
    raise FileNotFoundError(f"Input j2yaml file {path} does not exist!")
FileNotFoundError: Input j2yaml file @JCB_ALGO_YAML@ does not exist!
+ slurm_script[1]: postamble slurm_script 1730211510 1

test_gdasapp_atm_jjob_ens_letkf and test_gdasapp_atm_jjob_ens_obs failed due to inconsistencies introduced by the updated JEDI hashes which include changes from the Model Variable Renaming Sprint

5: ABORT: FieldMetadata::getLongNameFromAnyName: Searching for a field called skin_temperature_at_surface_where_sea in the long, short and io names but not found anywhere.
5:        in file '/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/sorc/gdas.cd/bundle/fv3-jedi/src/fv3jedi/FieldMetadata/FieldsMetadata.cc', line 142
4: ABORT: FieldMetadata::getLongNameFromAnyName: Searching for a field called skin_temperature_at_surface_where_sea in the long, short and io names but not found anywhere.
4:        in file '/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/sorc/gdas.cd/bundle/fv3-jedi/src/fv3jedi/FieldMetadata/FieldsMetadata.cc', line 142
1: ABORT: FieldMetadata::getLongNameFromAnyName: Searching for a field called skin_temperature_at_surface_where_sea in the long, short and io names but not found anywhere.
1:        in file '/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/sorc/gdas.cd/bundle/fv3-jedi/src/fv3jedi/FieldMetadata/FieldsMetadata.cc', line 142
2: Abort(1) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
3: Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
5: Abort(1) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
1: Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

All the test_gdasapp_bufr2ioda_insitu_ jobs failed with

  File "/scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/stable/20241029/global-workflow/sorc/gdas.cd/ush/ioda/bufr2ioda/marine/b2i/b2iconverter/bufr2ioda_converter.py", line 7, in <module>
    from pyiodaconv import bufr
ModuleNotFoundError: No module named 'pyiodaconv'

@RussTreadon-NOAA
Copy link
Contributor Author

Update
10/29/2024 GFS v18 JEDI Transition tag up outlined strategy for moving forward on this issue.

Create branch feature/resume_nightly for development pertaining to this issue.

@RussTreadon-NOAA
Copy link
Contributor Author

Issue reopened due to g-w PR #2992 being merged into g-w develop. GDASApp test_gdasapp tests which use g-w now pass.

Execute ci/stable_driver.sh on Hera as Russ.Treadon. Build and test_gdasapp successful. feature/stable-nightly updated to 47fee54. Despite this success, script stable_driver.sh generates error messages as described below

Examine /scratch1/NCEPDEV/stmp2/Russ.Treadon/test/stable_driver.log A non-zero return code is generated by

  # add in submodules
  git stash pop
  total=$(($total+$?))
  if [ $total -ne 0 ]; then
    echo "Unable to git stash pop" >> $stableroot/$datestr/output
  fi

Execution of the above yields

+ git stash pop
No stash entries found.
+ total=1
+ '[' 1 -ne 0 ']'
+ echo 'Unable to git stash pop'

The total=1 erroneously triggers following Unable messages and Problem email

+ total=1
+ caution=
+ '[' 1 -ne 0 ']'
+ echo 'Unable to commit'
+ git push --set-upstream origin feature/stable-nightly
To https://github.com/NOAA-EMC/GDASApp.git
   52c20c4..47fee54  feature/stable-nightly -> feature/stable-nightly
branch 'feature/stable-nightly' set up to track 'origin/feature/stable-nightly'.
+ total=1
+ '[' 1 -ne 0 ']'
+ echo 'Unable to push'
+ '[' 1 -ne 0 ']'
+ echo 'Issue merging with develop. please manually fix'
Issue merging with develop. please manually fix
+ [email protected]
+ SUBJECT='Problem updating feature/stable-nightly branch of GDASApp'
+ BODY=/scratch1/NCEPDEV/da/Russ.Treadon/CI/GDASApp/stable/20241119/output_stable_nightly
+ cat
+ mail -r 'Darth Vader - NOAA Affiliate <[email protected]>' -s 'Problem updating feature/stable-nightly branch of GDASApp' [email protected]

The following email was received

Problem updating feature/stable-nightly branch of GDASApp
Darth Vader - NOAA Affiliate <[email protected]>
To: [email protected]

Problem updating feature/stable-nightly branch of GDASApp. Please check /scratch1/NCEPDEV/da/Russ.Treadon/CI/GDASApp/stable/20241119/GDASApp

Despite the script indicating that problems were encountered and a Problem email being sent, branch feature/stable-nightly was actually updated to 47fee54.

I do know why the git stash and git stash pop lines are included in stable_driver.sh. Are they necessary?

Tagging @CoryMartin-NOAA and @DavidNew-NOAA

@CoryMartin-NOAA
Copy link
Contributor

git stash and git stash pop were done to add in changes to the new branch. If I recall, you merge in a branch, and then use stash pop to update submodule hashes. No changes to stash/pop suggests that nothing changed, so then why did the hash update? Is it because the GDAS hash updated but none of the JEDI hashes updated?

@RussTreadon-NOAA
Copy link
Contributor Author

47fee54 updated the following jedi hashes

sorc/fv3-jedi
sorc/ioda
sorc/iodaconv
sorc/oops
sorc/saber
sorc/soca
sorc/ufo
sorc/vader

Prior to 47fee54, commit e06890b merged develop into feature/stable-nightly

@RussTreadon-NOAA
Copy link
Contributor Author

Here's the order of operations in ci/stable_driver.sh

  1. checkout g-w develop in global-workflow directory
  2. cd into sorc/gdas.cd
  3. checkout GDASApp develop
  4. execute ush/submodules/update_develop.sh
  5. execute run_gw_ci. This script builds g-w apps, executes link_workflow.sh, and executes test_gdasapp
  6. pending success (rc=0), do the following in sorc/gdas.cd
    • git stash
    • git checkout feature/stable-nightly
    • git merge develop
    • git stash pop
    • git diff-index --quiet HEAD || git commit -m "Update to new stable build on $datestr"
    • git push --set-upstream origin feature/stable-nightly
  7. search for and remove old working directories

Does git stash do anything when only submodules are updated? It seems the answer is no given the following output written to the stable_driver log file

+ cd /scratch1/NCEPDEV/da/Russ.Treadon/CI/GDASApp/stable/20241119/global-workflow/sorc/gdas.cd
+ git stash
No local changes to save
+ total=0

Given that git stash didn't do anything, it makes sense than that git stash pop returns a non-zero return code. It doesn't have anything to pop back. Given that there's nothing to do, git stash pop generates a non-zero return code.

+ git stash pop
No stash entries found.
+ total=1
+ '[' 1 -ne 0 ']'
+ echo 'Unable to git stash pop'

It seems stable_driver.sh worked as intended. The only issue is that logic erroneously flagged the non-zero return code from git stash pop as an error. I don't think it was an error in this case.

@RussTreadon-NOAA
Copy link
Contributor Author

11/20/2024 update

Working copy of stable_driver.sh set to run daily at 0030 UTC on Hera in user account Russ.Treadon. Updates committed to feature/stable-nightly. Will monitor behavior for several days. If script behaves as expected, commit modified stable_driver.sh to feature/stable-nightly.

Need to discuss with larger group an procedure / process by which the updated feature/stable-nightly is merged into develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants