Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Refactor setup.py to remove use of global variables. #505

Merged

Conversation

christinaholtNOAA
Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA commented Dec 2, 2022

DESCRIPTION OF CHANGES:

Global variable use has been removed in setup.py, and reduced in generate_FV3LAM_wflow.py. The use of globals is a carry-over from the bash era of this utility, and does not meet modern coding standards.

This work is one of the preparation steps for providing dictionary configuration objects to an XML generator (still to come).

I will leave a review of the changes made here shortly after I open the PR.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

{%- set fmtstr=" %0"~ndigits_ensmem_names~"d" -%}
{{- fmtstr%m -}}
{%- endfor %} </var>
<var name="{{ ensmem_indx_name }}">{% for m in range(1, num_ens_members+1) %}{{ "%03d " % m }}{% endfor %}</var>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Members should be named with a consistent set of zeros so it's predictable from experiment to experiment. I set it to 3 here so that we can support the vast majority of use cases.

@@ -1026,7 +1030,7 @@ model_ver="we2e""
#
# Set NCO mode OPSROOT
#
OPSROOT=\"${opsroot}\""
OPSROOT=\"${opsroot:-$OPSROOT}\""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the default when the user does not provide a value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this particular change, but have you tested that specifying OPSROOT from the command line still works. That and 5 other root directories are passed through the environment on WCOSS2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa Thanks for bringing that to my attention. I did not notice that was a thing, and have most definitely messed it up.

I will look into that mechanism a bit more. Just to clarify, those env variables are passed at configuration time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are available during worklfow generation and will override any config settings for OPSROOT, COMROOT etc.

DIAG_TABLE_TMPL_FN: diag_table
FIELD_TABLE_TMPL_FN: field_table
DIAG_TABLE_TMPL_FN: diag_table.FV3_GFS_v15p2
FIELD_TABLE_TMPL_FN: field_table.FV3_GFS_v15p2
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup.py script no longer does this for us, so we need to specify it here in full.

METPLUS_CONF: '{{ [PARMdir, "metplus"]|path_join }}'
MET_CONFIG: '{{ [PARMdir, "met"]|path_join }}'
UFS_WTHR_MDL_DIR: '{{ user.UFS_WTHR_MDL_DIR }}'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added ability to use Jinja2 templates in the config decouples the setup.py script from the configuration for simple substitutions, path joining, math, etc. I think it also improves general readability and usability, too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that config files have templating option!

WRTCMP_write_groups: 1
WRTCMP_write_tasks_per_group: 20
WRTCMP_write_groups: ""
WRTCMP_write_tasks_per_group: ""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set these values here, and also provide LAYOUT_X and LAYOUT_Y from the user config, PE_MEMBER01 is computed before it should be. I thought it would be relevant to treat them just as all the other WRTCMP variables below.

params_dict["LAYOUT_Y"] = LAYOUT_Y
if BLOCKSIZE is not None:
params_dict["BLOCKSIZE"] = BLOCKSIZE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in setup.py

RUN_TASK_MAKE_GRID = True
RUN_TASK_MAKE_OROG = True
RUN_TASK_MAKE_SFC_CLIMO = True

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for this has moved up to line 444, and has also been changed to raise an exception to tell the user that there's no way to get the files, instead of assuming this is what should be done.

"NY": NY,
"NHW": NHW,
"STRETCH_FAC": STRETCH_FAC,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case should never be realized. If it is, there should be an error. That logic is now near line 602.

#
# -----------------------------------------------------------------------
#
settings = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these were set in the expt_config already.

for sect, sect_keys in expt_config.items():
for k, v in sect_keys.items():
expt_config[sect][k] = str_to_list(v)
extend_yaml(expt_config)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all the validation and value-updating, iterate through filling any remaining jinja templates that may exist in the configuration.

@danielabdi-noaa
Copy link
Collaborator

@christinaholtNOAA This is great work! Especially, I like the jinja templating of config files which should bring a lot of power and flexibility to how we setup config files.

@MichaelLueken MichaelLueken changed the title Refactor setup.py to remove use of global variables. [develop] Refactor setup.py to remove use of global variables. Dec 5, 2022
Remove print statements from extend_yaml.
Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA Thanks for addressing my suggestions! Looks good to me.

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor merge issues.

ush/config_defaults.yaml Outdated Show resolved Hide resolved
ush/config_defaults.yaml Outdated Show resolved Hide resolved
@christinaholtNOAA
Copy link
Collaborator Author

Okay, I was able to successfully complete all the fundamental tests on Hera yesterday after my changes and do not expect that this morning's changes will make a difference. I can kick off the jet tests to see where this PR stands.

@christinaholtNOAA christinaholtNOAA added the ci-jet-intel-WE Kicks off automated workflow test on jet with intel label Dec 13, 2022
@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA Great to hear! Please try to run them on Jet, however, I think that Jet might be going down shortly for maintenance.

@christinaholtNOAA christinaholtNOAA added the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Dec 13, 2022
@venitahagerty venitahagerty removed the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Dec 13, 2022
@venitahagerty
Copy link
Collaborator

venitahagerty commented Dec 13, 2022

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1143614160/20221213153512/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: community_ensemble_2mems_stoch
Experiment Succeeded on hera: pregen_grid_orog_sfc_climo
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Failed on hera: MET_ensemble_verification
2022-12-13 16:28:07 +0000 :: hfe10 :: Task run_pointstatvx_mem001, jobid=38970055, in state DEAD (FAILED), ran for 21.0 seconds, exit status=256, try=1 (of 1)
All experiments completed

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA I have manually submitted the fundamental tests manually on Hera and Cheyenne. All tests have successfully ran on these machines (no fatal errors were encountered). My concerns have also been addressed, so I will now approve this work. The Jenkins tests will be submitted first thing tomorrow morning, once Jet is back and (hopefully) the Gaea queue has been cleared.

@venitahagerty venitahagerty removed the ci-jet-intel-WE Kicks off automated workflow test on jet with intel label Dec 14, 2022
@venitahagerty
Copy link
Collaborator

venitahagerty commented Dec 14, 2022

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1143614160/20221214005019/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE
Experiment Succeeded on jet: custom_ESGgrid
Experiment Succeeded on jet: custom_GFDLgrid
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: specify_RESTART_INTERVAL
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: specify_DOT_OR_USCORE
All experiments completed

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Dec 14, 2022
@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA While running the standard 9 fundamental tests on Gaea, I encountered a new fatal error -

nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR encountered the following fatal error:

  File "/lustre/f2/pdata/ncep/Michael.Lueken/ufs-srweather-app/ush/setup.py", line 1099, in setup
    res_in_fns = link_fix(
  File "/lustre/f2/pdata/ncep/Michael.Lueken/ufs-srweather-app/ush/link_fix.py", line 317, in link_fix
    create_symlink_to_file(fp, fn, relative_link_flag)
  File "/lustre/f2/pdata/ncep/Michael.Lueken/ufs-srweather-app/ush/python_utils/create_symlink_to_file.py", line 40, in create_symlink_to_file
    print_err_msg_exit(
  File "/lustre/f2/pdata/ncep/Michael.Lueken/ufs-srweather-app/ush/python_utils/print_msg.py", line 20, in print_err_msg_exit
    traceback.print_stack(file=sys.stderr)
FATAL ERROR: 
Cannot create symlink to specified target file because the latter does
not exist or is not a file:
    target = 'RRFS_CONUScompact_25km/C_mosaic.halo6.nc'
Exiting with nonzero status.

I ran a test using develop and the test was submitted without issue. Please take a look at what might be happening here. Thanks!

@MichaelLueken MichaelLueken added the DO_NOT_MERGE Ensure that a PR isn't merged label Dec 14, 2022
@MichaelLueken
Copy link
Collaborator

Additional details on the Gaea fatal error:

The file in question should be:

fix/fix_lam/C403_mosaic.halo6.nc

I'm not sure why it is looking for C_mosaic.halo6.nc

Is this due to the following warning message seen?

GRID_DIR not specified!
Setting GRID_DIR = RRFS_CONUScompact_25km

@christinaholtNOAA
Copy link
Collaborator Author

I have definitely found room for improvement in terms of logging messages as I try to debug this one.

I am seeing that there is a bit of a gap in the gaea machine file. It does not specify DOMAIN_PREGEN_BASEDIR only TEST_DOMAIN_PREGEN_BASEDIR. The run_WE2E_tests.sh script fills in GRID_DIR, OROG_DIR, and SFC_CLIMO_DIR with a derivative of TEST_DOMAIN_PREGEN_BASEDIR when it runs.

I will update the logic to match this behavior, but does it make sense to set these three in the test script and NOT set DOMAIN_PREGEN_BASEDIR?

Change logic for handling grid_dir, etc. when specified separate from
pregen_basedir.
@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA I'm certainly open to this, but since the machine files are still necessary, and the locations for the DOMAIN_PREGEN_DIR are different across the machines, it make sense to keep this in the machine files, at least for now.

@christinaholtNOAA
Copy link
Collaborator Author

Since I don't have Gaea access today, I tested this test again on hera, but without the DOMAIN_PREGEN_BASEDIR defined in the machine file, and it passed. Fingers crossed that means this would work on gaea.

@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken I'm sorry, my suggestion/question wasn't clear!

I definitely think we need to keep the machine files. I also think it's important to have all the machine files that have staged data include full information about that data. For example, as it currently stands, a user running on Gaea could not run their own NCO configuration easily without setting either DOMAIN_PREGEN_BASEDIR or all of the GRID_DIR, OROG_DIR, and SFC_CLIMO_DIR variables manually.

I think that there are 2 ways to approach it (both could be applied):

  • Add DOMAIN_PREGEN_BASEDIR to the gaea machine file.
  • In the test suite, set DOMAIN_PREGEN_BASEDIR for the platform being in the user config (prepared by run_WE2E_tests.sh) when TEST_DOMAIN_PREGEN_BASEDIR is provided in a machine file.

One down-side...had those two things been done, we wouldn't have had a test that caught that I got rid of the logic for using a user-specified GRID_DIR rather than a default PREGEN_BASEDIR.

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA Ah, I see. I agree, adding DOMAIN_PREGEN_BASEDIR to the gaea machine file and then setting DOMAIN_PREGEN_BASEDIR to TEST_DOMAIN_PREGEN_BASEDIR if it has been supplied by the machine file sounds great! Will you be opening a new PR for this, or adding it to this PR?

Also, I have successfully submitted the nco test on gaea and it is has successfully passed. No more fatal errors! I'll go ahead and remove the DO_NOT_MERGE label now. I'm done for the day, so you can go ahead and merge this work if you are ready.

Thanks again for working with me through the issues!

@MichaelLueken MichaelLueken removed the DO_NOT_MERGE Ensure that a PR isn't merged label Dec 16, 2022
@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken I'll open a 2nd PR for gaea. I'll go ahead and merge this one now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants