Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Fixed issue #649 and tested on cheyenne. #650

Merged
merged 3 commits into from
Mar 10, 2023
Merged

[develop] Fixed issue #649 and tested on cheyenne. #650

merged 3 commits into from
Mar 10, 2023

Conversation

padhrigmccarthy
Copy link
Contributor

DESCRIPTION OF CHANGES:

Please review this one-line change, that I believe resolves issue #649. As far as I can tell, the problem was a typo that places the '-n ${NUMTS}' argument before the gefs2lbc_para executable instead of after. This causes mpirun to fail on cheyenne because it's an invalid mpirun argument.

Type of change

  • [ x] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

Now runs on cheyenne with RUN_TASK_AQM_LBCS and DO_AQM_GEFS_LBCS both set to true. I have extensive local configuration changes that allow the overall workflow to run on cheyenne. I have no access to Hera, Orion, WCOSS3, etc.

ISSUE:

#649

Copy link
Collaborator

@chan-hoo chan-hoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@padhrigmccarthy, with this change, 1) the run time of this task aqm_lbcs increased significantly on WCOSS2. (before: <1min, after: >15mins) 2) the task failed on Hera with the following error: [h24c53:300358:0:300358] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[h24c53:300358:0:300358] ib_mlx5_log.c:139 RC QP 0x86f9 wqe[0]: SEND --e [va 0x2b609edef280 len 34 lkey 0x3eff03]
[h24c53:300357:0:300357] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[h24c53:300357:0:300357] ib_mlx5_log.c:139 RC QP 0x86eb wqe[0]: SEND --e [va 0x2b2d3dbed200 len 34 lkey 0x1cba0f]

Copy link
Collaborator

@chan-hoo chan-hoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@padhrigmccarthy, the run on wcoss2 seems to be stuck. It doesn't go forward either.

@chan-hoo
Copy link
Collaborator

chan-hoo commented Mar 7, 2023

@ytangnoaa is the original developer of the code and script. @ytangnoaa, do you have any idea to fix this issue?

@padhrigmccarthy
Copy link
Contributor Author

@chan-hoo Thank you for testing on Hera and WCOSS2! I don't fully understand how this runs without this change on those platforms, though they must use something other than mpirun for RUN_CMD_UTILS. In any case, your findings indicate that the change I've proposed needs to be a local mod when running online-cmaq on cheyenne.

I'm open to suggestions on how to accomplish this without having each cheyenne user edit exregional_aqm_lbcs.sh before using RUN_TASK_AQM_LBCS and DO_AQM_GEFS_LBCS. Thank you again!

@chan-hoo
Copy link
Collaborator

chan-hoo commented Mar 7, 2023

@padhrigmccarthy, I think you should change the machine files ush/machine/[machine].yaml and the command in the script.

  1. scripts/exregional_aqm_lbcs.sh: ${RUN_CMD_UTILS} -n ${NUMTS} ---> ${RUN_CMD_AQMLBC}
  2. add the following line between Ln 19-20 in ush/machine/hera.yaml: RUN_CMD_AQMLBC: srun --export=ALL -np $nprocs -n ${NUMTS}
  3. add the following line between Ln 19-20 in ush/machine/wcoss2.yaml: RUN_CMD_AQMLBC: mpirun -n ${nprocs} -n ${NUMTS}
  4. add the following line between Ln 19-20 in ush/machine/cheyenne.yaml: RUN_CMD_AQMLBC: put the command here

@chan-hoo
Copy link
Collaborator

chan-hoo commented Mar 7, 2023

@ytangnoaa, I got a question when I wrote the above comment for wcoss2.yaml. Is the command mpirun -n ${nprocs} -n ${NUMTS} correct?? The -n flag is repeated. What do you think about it?

@ytangnoaa
Copy link

@ytangnoaa, I got a question when I wrote the above comment for wcoss2.yaml. Is the command mpirun -n ${nprocs} -n ${NUMTS} correct?? The -n flag is repeated. What do you think about it?

You are right. We should only keep '-n ${NUMTS}'

@padhrigmccarthy
Copy link
Contributor Author

padhrigmccarthy commented Mar 7, 2023 via email

mpirun -np ${NUMTS} on cheyenne
mpiexec -n ${NUMTS} on wcoss2
srun --export=ALL -n ${NUMTS} on hera
Broke these out into RUN_CMD_AQMLBC, defined in the machine/machine.yaml files.
@padhrigmccarthy
Copy link
Contributor Author

@chan-hoo When I follow your suggestion, I get the following error when running on cheyenne. It seems that adding RUN_AQMLBC to ush/machine/cheyenne.yaml is not enough to define the variable. Does it also need to be added to config_defaults.yaml?
log/aqm_lbcs_2020081700.log:

  • PREP_STEP
  • :
    /glade/work/paddy/online-cmaq/ufs-srweather-app/scripts/exregional_aqm_lbcs.sh: line 209: RUN_CMD_AQMLBC: unbound variable
    FATAL ERROR:

@chan-hoo
Copy link
Collaborator

chan-hoo commented Mar 8, 2023

@padhrigmccarthy, you should define a new parameter in ush/config_defaults.yaml: add it to Ln 201~213 as follows:

  # RUN_CMD_AQM:
  # The run command for some AQM tasks.
  #
  # RUN_CMD_AQMLBC:
  # The run command for AQM_LBCS task.
  #
  #-----------------------------------------------------------------------
  #   
  RUN_CMD_SERIAL: ""
  RUN_CMD_UTILS: ""
  RUN_CMD_FCST: ""
  RUN_CMD_POST: "" 
  RUN_CMD_AQM: "" 
  RUN_CMD_AQMLBC: ""

@chan-hoo
Copy link
Collaborator

chan-hoo commented Mar 8, 2023

@padhrigmccarthy, in addition, can you add RUN_CMD_AQMLBC to ush/machine/orion.yaml too? It should be the same as that of Hera:

RUN_CMD_AQMLBC: srun --export=ALL -n ${NUMTS}

ush/machine/hera.yaml Outdated Show resolved Hide resolved
@chan-hoo
Copy link
Collaborator

chan-hoo commented Mar 8, 2023

@padhrigmccarthy, I confirm that your change works well on Hera as well as WCOSS2. Once you confirm it works correctly on Cheyenne, I'll approve this PR.

@padhrigmccarthy
Copy link
Contributor Author

@padhrigmccarthy, I confirm that your change works well on Hera as well as WCOSS2. Once you confirm it works correctly on Cheyenne, I'll approve this PR.

@chan-hoo The small changes I just pushed run successfully on Cheyenne. Thank you for all of your help!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 8, 2023
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@padhrigmccarthy The Jenkins tests passed for all machines, with the exception of Orion, which is a known issue. Manual testing of the WE2E tests on Orion shows that all tests successfully pass. Since these changes look good to me, I will now approve of these changes!

@MichaelLueken MichaelLueken merged commit 86c3eb0 into ufs-community:develop Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AQM exregional_aqm_lbcs.sh specifies '-n ${NUMTS}' before gefs2lbc_para executable
4 participants