Rationalize timeouts for rolling-update #3658

justinsb · 2017-10-17T15:44:53Z

The intervals remain the minimum time between instances; drain &
validate time is additional.

The intervals remain the minimum time between instances; drain & validate time is additional.

chrislovecnm · 2017-10-17T17:49:31Z

So I understand that my PR changed functionality.

I am ok with breaking out the Validation time value, but we need a validation time based on IG types. Masters can take about the same time. Some instance groups types could take forever to drain.

How do we keep the original functionality, but not allow the users to do something really silly? Setting a 10ms time to me should do a cloud only. Should we set that automatically below a certain threshold?

What ideas do you have?

I almost think we should look at putting this information into the IG API. This will make a cluster doing cluster self-upgrades more possible, and make nut like me who need super complex upgrades feasible.

justinsb · 2017-10-22T21:00:53Z

If & when people want to change the validation timeout, this should be a flag I think. But that's a separate PR. I think because we wait for the master-interval / node-interval timeout in between, we effectively have a per-IG validation time.

Users should be allowed to set low timeouts, and we shouldn't magically switch modes on them - violates the principle of least surprise.

I'm also fine with changing the default validation timeout to be higher - I don't consider that a breaking change, and I think a release note would be sufficient.

chrislovecnm · 2017-10-25T00:33:15Z

If we tell the user that we are switching nodes on them wouldn't that be ok? If the user sets the timeout under 1 min or 30 seconds then we warn them and turn off validation. Heck why do drain? It will monkey things as well.

Rolling a cluster with 20 sec timeouts might as well be 1 sec timeouts. It is not best practice for usual production use cases, and will cause down time.

By default kops rolling update should do a rolling update that is production recommended. You are setting a timeout, why not set cloud only?

What is your concern of changing this behavior?

chrislovecnm · 2017-10-25T00:40:59Z

Addressing the comment about a user should be able to set a low timeout. They can, but when you do that the cluster will not validate. They need to be told hey your cluster did not validate, we are exiting. Then if they want to force the issue the user can set another flag and go on there merry way.

The other option is that the user actually reads the docs, and sets the flags accordingly.

We turned on validation to have users roll there clusters and not run into situations like kube-dns falling over. But they can do 20ms waits, they just need to set cloud only.

Thoughts?

justinsb · 2017-11-04T04:46:17Z

So the previous behaviour was that if the user sets a master-interval we would wait that time between master restarts. We no longer do that, and this PR fixes that.

I don't understand the counter-proposal @chrislovecnm . Can you send a PR?

justinsb · 2017-11-05T19:54:48Z

Ping @chrislovecnm ... holding up the release choo-choo train here!

chrislovecnm · 2017-11-05T20:33:53Z

/lgtm

k8s-github-robot · 2017-11-05T20:35:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrislovecnm

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~OWNERS~~ [chrislovecnm]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-11-05T20:38:35Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2017-11-05T21:11:21Z

Automatic merge from submit-queue.

Rationalize timeouts for rolling-update

eec1141

The intervals remain the minimum time between instances; drain & validate time is additional.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 17, 2017

justinsb added P0 blocks-next labels Nov 5, 2017

chrislovecnm approved these changes Nov 5, 2017

View reviewed changes

k8s-ci-robot assigned chrislovecnm Nov 5, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2017

k8s-github-robot merged commit 9c30c30 into kubernetes:master Nov 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalize timeouts for rolling-update #3658

Rationalize timeouts for rolling-update #3658

justinsb commented Oct 17, 2017

chrislovecnm commented Oct 17, 2017

justinsb commented Oct 22, 2017

chrislovecnm commented Oct 25, 2017

chrislovecnm commented Oct 25, 2017

justinsb commented Nov 4, 2017

justinsb commented Nov 5, 2017

chrislovecnm commented Nov 5, 2017

k8s-github-robot commented Nov 5, 2017

k8s-github-robot commented Nov 5, 2017

k8s-github-robot commented Nov 5, 2017

Rationalize timeouts for rolling-update #3658

Rationalize timeouts for rolling-update #3658

Conversation

justinsb commented Oct 17, 2017

chrislovecnm commented Oct 17, 2017

justinsb commented Oct 22, 2017

chrislovecnm commented Oct 25, 2017

chrislovecnm commented Oct 25, 2017

justinsb commented Nov 4, 2017

justinsb commented Nov 5, 2017

chrislovecnm commented Nov 5, 2017

k8s-github-robot commented Nov 5, 2017

k8s-github-robot commented Nov 5, 2017

k8s-github-robot commented Nov 5, 2017