-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rationalize timeouts for rolling-update #3658
Rationalize timeouts for rolling-update #3658
Conversation
The intervals remain the minimum time between instances; drain & validate time is additional.
So I understand that my PR changed functionality. I am ok with breaking out the Validation time value, but we need a validation time based on IG types. Masters can take about the same time. Some instance groups types could take forever to drain. How do we keep the original functionality, but not allow the users to do something really silly? Setting a 10ms time to me should do a cloud only. Should we set that automatically below a certain threshold? What ideas do you have? I almost think we should look at putting this information into the IG API. This will make a cluster doing cluster self-upgrades more possible, and make nut like me who need super complex upgrades feasible. |
If & when people want to change the validation timeout, this should be a flag I think. But that's a separate PR. I think because we wait for the master-interval / node-interval timeout in between, we effectively have a per-IG validation time. Users should be allowed to set low timeouts, and we shouldn't magically switch modes on them - violates the principle of least surprise. I'm also fine with changing the default validation timeout to be higher - I don't consider that a breaking change, and I think a release note would be sufficient. |
If we tell the user that we are switching nodes on them wouldn't that be ok? If the user sets the timeout under 1 min or 30 seconds then we warn them and turn off validation. Heck why do drain? It will monkey things as well. Rolling a cluster with 20 sec timeouts might as well be 1 sec timeouts. It is not best practice for usual production use cases, and will cause down time. By default kops rolling update should do a rolling update that is production recommended. You are setting a timeout, why not set cloud only? What is your concern of changing this behavior? |
Addressing the comment about a user should be able to set a low timeout. They can, but when you do that the cluster will not validate. They need to be told hey your cluster did not validate, we are exiting. Then if they want to force the issue the user can set another flag and go on there merry way. The other option is that the user actually reads the docs, and sets the flags accordingly. We turned on validation to have users roll there clusters and not run into situations like kube-dns falling over. But they can do 20ms waits, they just need to set cloud only. Thoughts? |
So the previous behaviour was that if the user sets a master-interval we would wait that time between master restarts. We no longer do that, and this PR fixes that. I don't understand the counter-proposal @chrislovecnm . Can you send a PR? |
Ping @chrislovecnm ... holding up the release choo-choo train here! |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chrislovecnm The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. |
The intervals remain the minimum time between instances; drain &
validate time is additional.