-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test validateCluster twice to make sure it does not flap #8088
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zetaab The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think this is just papering over the problem. If a node is briefly validating before starting to throw failures again, this will cause rolling update to start draining the next node before the previous one is truly ready. I think that after a cluster validates after terminating a node, we might want to wait a few seconds and then validate again to make sure it isn't flapping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no unit tests for the new functionality. There should be.
what happens when this happens:
I do not see how this problem can be avoided without retrying? How we could know beforehand what kind of pods we have there "starting soon"? Master service pods do not exist at all in cluster when we do rolling updates to master. I just updated 15 cluster using 1.17.0-alpha.1. The problem existed about almost half of them, so it was needed to execute rolling-update command many times before all masters was rotated. |
@johngmyers what you think about this solution what I modified now? It is much simpler, but is it so good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could meld together the logic in failAfterOneNodeClusterValidator
and failThreeTimesClusterValidator
to write a unit test for this.
For me, the Cilium pod gets created before the node goes into ready state and takes so long to become ready that all the other kube-system pods get created first. So I don't have a good feel for what's going on in your masters. There is the possibility that a kube-system pod could just flap unready due to ambient chaos, coinciding with the initial |
I tried to understand how these tests works without success. So I cannot write tests for this if someone does not tell what is the logic in these tests? How I could reproduce flapping behaviour in tests. Any of these validate function calls in tests is not called multiple times if I increase validation timeout |
The tests use a test-specific fake
|
it does not work like that. Validate() is only called once, no matter how many time I will call validate in tryvalidatecluster |
/test pull-kops-e2e-kubernetes-aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a sane change to me for added validation
/lgtm
/test pull-kops-e2e-kubernetes-aws |
fixes #8078
/kind bug