cluster validation - allow flapping of validation errors #11049
Merged
+2
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously with --wait if a cluster successfully validated and then a subsequent validation failed
(perhaps due to a new critical pod being scheduled and not being ready) we would previously fail the
validate cluster
command immediately.This will now reset the success counter that approaches --count, allowing validation attempts to continue until we timeout from --wait.
I'm hoping this fixes prow job failures like this where
kops validate cluster --count 10 --wait 15m
was invoked at23:15:48
but exited with failure at23:22:59
(a duration of only 7m11s). It passed validation 4 consecutive times but failed when a kube-proxy pod became pending. In that case we should just reset the counter and continue validation attempts until we timeout.In my opinion,
kops validate cluster --count 10 --wait 15m
should only ever exit with failure if the 15 minute timeout has been reached.