-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Fixed issue #649 and tested on cheyenne. #650
[develop] Fixed issue #649 and tested on cheyenne. #650
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@padhrigmccarthy, with this change, 1) the run time of this task aqm_lbcs
increased significantly on WCOSS2. (before: <1min, after: >15mins) 2) the task failed on Hera with the following error: [h24c53:300358:0:300358] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[h24c53:300358:0:300358] ib_mlx5_log.c:139 RC QP 0x86f9 wqe[0]: SEND --e [va 0x2b609edef280 len 34 lkey 0x3eff03]
[h24c53:300357:0:300357] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[h24c53:300357:0:300357] ib_mlx5_log.c:139 RC QP 0x86eb wqe[0]: SEND --e [va 0x2b2d3dbed200 len 34 lkey 0x1cba0f]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@padhrigmccarthy, the run on wcoss2 seems to be stuck. It doesn't go forward either.
@ytangnoaa is the original developer of the code and script. @ytangnoaa, do you have any idea to fix this issue? |
@chan-hoo Thank you for testing on Hera and WCOSS2! I don't fully understand how this runs without this change on those platforms, though they must use something other than mpirun for RUN_CMD_UTILS. In any case, your findings indicate that the change I've proposed needs to be a local mod when running online-cmaq on cheyenne. I'm open to suggestions on how to accomplish this without having each cheyenne user edit exregional_aqm_lbcs.sh before using RUN_TASK_AQM_LBCS and DO_AQM_GEFS_LBCS. Thank you again! |
@padhrigmccarthy, I think you should change the machine files
|
@ytangnoaa, I got a question when I wrote the above comment for |
You are right. We should only keep '-n ${NUMTS}' |
I am about to suggest changes that clean up a few details in Chan-Hoo's
suggestion. The issue is that mpirun (cheyenne) uses -np, but the other
hosts use processes that use (a single) -n flag.
…On Tue, Mar 7, 2023 at 2:46 PM Youhua Tang ***@***.***> wrote:
@ytangnoaa <https://github.com/ytangnoaa>, I got a question when I wrote
the above comment for wcoss2.yaml. Is the command mpirun -n ${nprocs} -n
${NUMTS} correct?? The -n flag is repeated. What do you think about it?
You are right. We should only keep '-n ${NUMTS}'
—
Reply to this email directly, view it on GitHub
<#650 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNGGSZNROSENB7TKGLLOIDW26GANANCNFSM6AAAAAAVSZOGDM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
mpirun -np ${NUMTS} on cheyenne mpiexec -n ${NUMTS} on wcoss2 srun --export=ALL -n ${NUMTS} on hera Broke these out into RUN_CMD_AQMLBC, defined in the machine/machine.yaml files.
@chan-hoo When I follow your suggestion, I get the following error when running on cheyenne. It seems that adding RUN_AQMLBC to ush/machine/cheyenne.yaml is not enough to define the variable. Does it also need to be added to config_defaults.yaml?
|
@padhrigmccarthy, you should define a new parameter in
|
@padhrigmccarthy, in addition, can you add
|
@padhrigmccarthy, I confirm that your change works well on Hera as well as WCOSS2. Once you confirm it works correctly on Cheyenne, I'll approve this PR. |
@chan-hoo The small changes I just pushed run successfully on Cheyenne. Thank you for all of your help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@padhrigmccarthy The Jenkins tests passed for all machines, with the exception of Orion, which is a known issue. Manual testing of the WE2E tests on Orion shows that all tests successfully pass. Since these changes look good to me, I will now approve of these changes!
DESCRIPTION OF CHANGES:
Please review this one-line change, that I believe resolves issue #649. As far as I can tell, the problem was a typo that places the '-n ${NUMTS}' argument before the gefs2lbc_para executable instead of after. This causes mpirun to fail on cheyenne because it's an invalid mpirun argument.
Type of change
TESTS CONDUCTED:
Now runs on cheyenne with RUN_TASK_AQM_LBCS and DO_AQM_GEFS_LBCS both set to true. I have extensive local configuration changes that allow the overall workflow to run on cheyenne. I have no access to Hera, Orion, WCOSS3, etc.
ISSUE:
#649