You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When all hosts in a platform are uncontactable we remove all the hosts on that platform from the list of bad hosts to allow submission retries.
It doesn't look like this happens when all the platforms in a group are exhausted leading to submit retries being ineffective at handling short term network blips.
Reproducible Example
# ~/.ssh/config# Set up an alias to allow you to "break" comms with the host
Host mymachine
HostName=<some-valid-host-name>
Watch the workflow logs until the first attempt at foo has succeeded, then remove the lines in the ssh config.
Watch the workflow logs for submit-failure, then restore the ssh config.
Watch as the submit retries fail to actually retry, because the host is still marked as bad.
Expected Behaviour
If all the hosts of all the platforms in a group are bad, all the hosts of all the platforms should be removed from the bad-hosts set to allow resubmission.
Note on logging
Logging, especially error logging is not alway clear where selection of hosts from platform groups has failed:
The text was updated successfully, but these errors were encountered:
Description
When all hosts in a platform are uncontactable we remove all the hosts on that platform from the list of bad hosts to allow submission retries.
It doesn't look like this happens when all the platforms in a group are exhausted leading to submit retries being ineffective at handling short term network blips.
Reproducible Example
foo
has succeeded, then remove the lines in the ssh config.Expected Behaviour
If all the hosts of all the platforms in a group are bad, all the hosts of all the platforms should be removed from the bad-hosts set to allow resubmission.
Note on logging
Logging, especially error logging is not alway clear where selection of hosts from platform groups has failed:
The text was updated successfully, but these errors were encountered: