Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AQM configuration files do not work after YAML change PR #676 #709

Closed
chan-hoo opened this issue Apr 2, 2023 · 16 comments · Fixed by #722
Closed

AQM configuration files do not work after YAML change PR #676 #709

chan-hoo opened this issue Apr 2, 2023 · 16 comments · Fixed by #722
Assignees
Labels
bug Something isn't working Priority: HIGH

Comments

@chan-hoo
Copy link
Collaborator

chan-hoo commented Apr 2, 2023

Expected behavior

  • The AQM configuration files config.aqm.community.yaml and config.aqm.nco.realtime.yaml generate XML files.

Current behavior

  • config.aqm.community.yaml: XML file is generated, but the workflow dependencies are not set correctly.
  • config.aqm.nco.realtime.yaml: XML file is not generated.

Steps To Reproduce

  1. cp config.aqm.nco.realtime.yaml config.yaml
  2. python3 generated_FV3LAM_wflow.py
@christinaholtNOAA
Copy link
Collaborator

christinaholtNOAA commented Apr 2, 2023

@chan-hoo I'm sorry for the inconvenience. I've prepared the fix_aqm branch in my repo with changes that I think should fix AQM in Community and NCO mode. I got an XML from the community config that appeared to do what is expected. I am unable to get an XML for the NCO mode as I have no information about how to test it as my own user on Hera.

I can either open a PR to develop, or to your current working branch. The latter may be better so that you can test the changes with your others to ensure they work.

Let me know what you'd like me to do with the branch.

We should also discuss how SRW can more thoroughly test the features you need to protect for AQM.

EDIT: You're also welcome to merge my branch into yours without me opening a PR, if you'd like.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 4, 2023

@christinaholtNOAA, thank you for your quick fix! I agree with you. I'll test your branch and merge it to my PR.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 4, 2023

@christinaholtNOAA, the following parent.nnodes and parent.ppn are not updated on wcoss2:

    <task name="run_fcst_mem#mem#" cycledefs="forecast" maxtries="1" >
      <account>&ACCOUNT;</account>
      <native>-l place=excl</native>
      <nodes>{{ parent.nnodes }}:ppn={{ parent.ppn }}:tpp=1</nodes>

How can I fix this?

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 4, 2023

@christinaholtNOAA, when I remove parent. from them in ush/machine/wcoss2.yaml, it seems to work correctly. Is this the right solution?

@chan-hoo chan-hoo self-assigned this Apr 4, 2023
@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 4, 2023

@christinaholtNOAA, in the workflow of AQM, the task task_aqm_ics_ext is activated only when COLDSTART = FALSE. How can this be controlled in the workflow yaml file? I tried adding {%- if not workflow.COLDSTART %} but it failed.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 4, 2023

@christinaholtNOAA, the following dependency should be added to the workflow of AQM:

{%- if not coldstart %}
. Can you let me know how to do this?

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 6, 2023

@christinaholtNOAA, as mentioned above, the dependencies for AQM are not set correctly in the rocoto XML file. Please take a look at the above issues as early as possible.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 6, 2023

@christinaholtNOAA, the cycledefs cycled is missing in the workflow interface:

cycledefs: cycled

Where can I define it?

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 6, 2023

@christinaholtNOAA, can you explain why you updated FCST_LEN_HRS in ush/setup.py as follows?

workflow_config["FCST_LEN_HRS"] = min(fcst_len_cycl)

It makes all the scripts for AQM work incorrectly because it mislead the following statement (for example, in exregional_nexus_gfs_sfc.sh:
if [ "${FCST_LEN_HRS}" = "-1" ]; then

Didn't you check the AQM part at all when you made such a huge change in PR #676? Your PR is causing so many issues in AQM.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 6, 2023

@MichaelLueken, can you coordinate this issue with @christinaholtNOAA? She doesn't respond for a couple of days.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 6, 2023

@christinaholtNOAA, I think that the following change only works when {cyc} starts from 00:

CYCLE_IDX=$(( ${cyc} / ${INCR_CYCL_FREQ} ))

What if cyc=[6, 18] or cyc=[6, 12, 18] for one day long?

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 7, 2023

@christinaholtNOAA, can you explain why you updated FCST_LEN_HRS in ush/setup.py as follows?

workflow_config["FCST_LEN_HRS"] = min(fcst_len_cycl)

It makes all the scripts for AQM work incorrectly because it mislead the following statement (for example, in exregional_nexus_gfs_sfc.sh:

if [ "${FCST_LEN_HRS}" = "-1" ]; then

Didn't you check the AQM part at all when you made such a huge change in PR #676? Your PR is causing so many issues in AQM.

I introduce NUM_FCST_LEN_CYCL to resolve this issue:
https://github.com/chan-hoo/ufs-srweather-app/blob/05db0c2721d0c373b84aabedb7785369f6651a48/scripts/exregional_nexus_gfs_sfc.sh#L67

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Apr 7, 2023

@christinaholtNOAA, I think that the following change only works when {cyc} starts from 00:

CYCLE_IDX=$(( ${cyc} / ${INCR_CYCL_FREQ} ))

What if cyc=[6, 18] or cyc=[6, 12, 18] for one day long?

The first cycle should be subtracted from the current cycle to resolve this:
https://github.com/chan-hoo/ufs-srweather-app/blob/05db0c2721d0c373b84aabedb7785369f6651a48/scripts/exregional_nexus_gfs_sfc.sh#L68

@christinaholtNOAA
Copy link
Collaborator

@chan-hoo GitHub messages to my inbox are not the best way to get in touch with me, especially this week while we've had a NOAA-GSL retreat. Please feel free to email or Slack me directly to get my full attention on these matters. I am sorry for my delayed response, and for the stress this has caused.

In general, if you'd like to turn on/off a task, you can include it in the YAML file with no entry. So, if you've set COLDSTART=True in the YAML and other tasks need to be turned off, for now, you can enter the tasks that should not be activated like this:

rocoto:
  tasks:
    task_aqm_ics_ext:

I've added that to the config.aqm.nco.realtime.yaml in the branch I shared over the weekend.

Cycledefs may be defined or added by adding their definition to the rocoto section of the config YAML:

rocoto:  
  cycledefs:
    cycled:
      -  !startstopfreq ['{{workflow.DATE_FIRST_CYCL}}', '{{workflow.DATE_LAST_CYCL}}', '{{workflow.INCR_CYCL_FREQ}}']
    at_start:

If you want to remove the default ones, just include them above with no argument, and they will be excluded. I showed an example for the at_start cycle above.

I've pushed these two changes to the the aqm_fix branch. I'll continue working through your other comments now.

@christinaholtNOAA
Copy link
Collaborator

@chan-hoo I definitely overlooked the change to FCST_LEN_HRS in setup.py causing a problem in the run scripts. I will push a fix shortly, and update you here.

@christinaholtNOAA
Copy link
Collaborator

I pushed the change for the variable forecast length issue. It was quite similar to yours, but using bash to glean the same information from the existing configurable information instead of storing a new one.


I'm not sure that I understand the change for not starting at 00Z. Perhaps it is best if we require the user to stick to defining FCST_LEN_CYCL relative to 00Z. That should work even if the first fcst is for 18 Z.

  DATE_FIRST_CYCL: "2023031518"
  DATE_LAST_CYCL: "2023031600"
  INCR_CYCL_FREQ: 6
  FCST_LEN_CYCL:
    - 6
    - 12
    - 12
    - 6

This should result in a 6 hr forecast at both the cycles. Otherwise, we're going to run into run-time issues like you described. I think I was allowing this option to try to do the same thing:

  DATE_FIRST_CYCL: "2023031518"
  DATE_LAST_CYCL: "2023031600"
  INCR_CYCL_FREQ: 6
  FCST_LEN_CYCL:
    - 6
    - 6

And like you pointed out, that doesn't work because there is no 18 / 6 = 3 index in that array. We could also consider just padding the array with 4 entries in setup.py if the user doesn't provide it. I worry that we're going to over-engineer a solution that allows so much flexibility we can't thoroughly test for the outcomes.


Speaking of testing...we are in desperate need of AQM tests that actually protect the capabilities. There are a couple of problems:

  1. We don't have any tests that automatically check that we haven't broken one of our most important workflow configurations.
  2. The tests are not actually testing what they need to. For example, all of this might have run all the tasks, but do we know that we've gotten variable forecast lengths?
  3. We have no way to test the NCO config. It's a huge resource, and we have no guidance on how to set variables if we aren't on WCOSS2. So either we need automated Jenkins pipelines for those tests, or a similar configuration on a platform we already have Jenkins on.

Without these tests, we will always be breaking AQM. If it's not tested, it's broken. There's no way around it.

This is a very big problem for AQM and will be for RRFS and needs immediate attention. I will bring this up in the meeting later this morning, and will add it as an Issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Priority: HIGH
Projects
None yet
2 participants