Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreachable hosts not reset on platform group failure. #6100

Closed
wxtim opened this issue May 14, 2024 · 2 comments · Fixed by #6109
Closed

Unreachable hosts not reset on platform group failure. #6100

wxtim opened this issue May 14, 2024 · 2 comments · Fixed by #6109
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@wxtim
Copy link
Member

wxtim commented May 14, 2024

Description

When all hosts in a platform are uncontactable we remove all the hosts on that platform from the list of bad hosts to allow submission retries.

It doesn't look like this happens when all the platforms in a group are exhausted leading to submit retries being ineffective at handling short term network blips.

Reproducible Example

# ~/.ssh/config
# Set up an alias to allow you to "break" comms with the host
Host mymachine
    HostName=<some-valid-host-name>
# global.cylc
[platforms]
    [[myplatform]]
        hosts = mymachine
        install target = localhost

[platform groups]
    [[mygroup]]
        platforms = myplatform
# flow.cylc
[scheduler]
    allow implicit tasks = True
    cycle point format = %Y

[scheduling]
    initial cycle point = 1311
    final cycle point = 1344
    [[graph]]
        R1 = print-config
        P1Y = foo[-P1Y] => foo

[runtime]
    [[print-config]]
        script = cylc config
    [[foo]]
        script = sleep 10
        platform = mygroup
        submission retry delays = PT5S, PT5S, PT5S, PT5S, PT5S, PT5S
CYLC_CONF_PATH=/the/global/config/above cylc vip 
  1. Watch the workflow logs until the first attempt at foo has succeeded, then remove the lines in the ssh config.
  2. Watch the workflow logs for submit-failure, then restore the ssh config.
  3. Watch as the submit retries fail to actually retry, because the host is still marked as bad.

Expected Behaviour

If all the hosts of all the platforms in a group are bad, all the hosts of all the platforms should be removed from the bad-hosts set to allow resubmission.

Note on logging

Logging, especially error logging is not alway clear where selection of hosts from platform groups has failed:

@wxtim wxtim added the bug Something is wrong :( label May 14, 2024
@wxtim wxtim self-assigned this May 14, 2024
@wxtim wxtim added this to the 8.2.x milestone May 14, 2024
@ColemanTom
Copy link
Contributor

Thanks for finding a nice simple reproducer. I would not have thought to use the .ssh/config file. Hopefully this isn't a hard fix.

@wxtim
Copy link
Member Author

wxtim commented May 15, 2024

Thanks for finding a nice simple reproducer. I would not have thought to use the .ssh/config file.

There aren't very many things deliberately breaking your SSH config is good for, but it's a very easy way to simulate broken connections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants