Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize timeouts for rolling-update #3658

Conversation

justinsb
Copy link
Member

The intervals remain the minimum time between instances; drain &
validate time is additional.

The intervals remain the minimum time between instances; drain &
validate time is additional.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 17, 2017
@chrislovecnm
Copy link
Contributor

So I understand that my PR changed functionality.

I am ok with breaking out the Validation time value, but we need a validation time based on IG types. Masters can take about the same time. Some instance groups types could take forever to drain.

How do we keep the original functionality, but not allow the users to do something really silly? Setting a 10ms time to me should do a cloud only. Should we set that automatically below a certain threshold?

What ideas do you have?

I almost think we should look at putting this information into the IG API. This will make a cluster doing cluster self-upgrades more possible, and make nut like me who need super complex upgrades feasible.

@justinsb
Copy link
Member Author

If & when people want to change the validation timeout, this should be a flag I think. But that's a separate PR. I think because we wait for the master-interval / node-interval timeout in between, we effectively have a per-IG validation time.

Users should be allowed to set low timeouts, and we shouldn't magically switch modes on them - violates the principle of least surprise.

I'm also fine with changing the default validation timeout to be higher - I don't consider that a breaking change, and I think a release note would be sufficient.

@chrislovecnm
Copy link
Contributor

If we tell the user that we are switching nodes on them wouldn't that be ok? If the user sets the timeout under 1 min or 30 seconds then we warn them and turn off validation. Heck why do drain? It will monkey things as well.

Rolling a cluster with 20 sec timeouts might as well be 1 sec timeouts. It is not best practice for usual production use cases, and will cause down time.

By default kops rolling update should do a rolling update that is production recommended. You are setting a timeout, why not set cloud only?

What is your concern of changing this behavior?

@chrislovecnm
Copy link
Contributor

Addressing the comment about a user should be able to set a low timeout. They can, but when you do that the cluster will not validate. They need to be told hey your cluster did not validate, we are exiting. Then if they want to force the issue the user can set another flag and go on there merry way.

The other option is that the user actually reads the docs, and sets the flags accordingly.

We turned on validation to have users roll there clusters and not run into situations like kube-dns falling over. But they can do 20ms waits, they just need to set cloud only.

Thoughts?

@justinsb
Copy link
Member Author

justinsb commented Nov 4, 2017

So the previous behaviour was that if the user sets a master-interval we would wait that time between master restarts. We no longer do that, and this PR fixes that.

I don't understand the counter-proposal @chrislovecnm . Can you send a PR?

@justinsb
Copy link
Member Author

justinsb commented Nov 5, 2017

Ping @chrislovecnm ... holding up the release choo-choo train here!

@chrislovecnm
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrislovecnm

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2017
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue.

@k8s-github-robot k8s-github-robot merged commit 9c30c30 into kubernetes:master Nov 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. blocks-next cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. P0 size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants