Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promoting drain and validate by setting feature flag to true #3329

Merged

Conversation

chrislovecnm
Copy link
Contributor

@chrislovecnm chrislovecnm commented Sep 2, 2017

I am unable to recreate #2407, and frankly, it may be an edge case. We could warn a user if their wait times are low, but that would be another PR.

This PR moves Drain and Validate functionality for rolling-updates into the default user experience, setting the Feature Flag to true.

Per feedback, I am using the node and master interval times for the validation.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 2, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 2, 2017
@gambol99
Copy link
Contributor

gambol99 commented Sep 2, 2017

lgtm ...

@justinsb
Copy link
Member

justinsb commented Sep 2, 2017

Drain and validate are still broken for me with low intervals.

I'll verify and post output

@chrislovecnm
Copy link
Contributor Author

@justinsb should we force reasonable intervals or tell the user to do cloud only? Drain and validate is not made for low intervals.

Doing a update in production is not something that is super fast in my opinion.

@chrislovecnm
Copy link
Contributor Author

chrislovecnm commented Sep 5, 2017

@justinsb I have completed some testing tonight.

  1. Testing with 2m interval, which is a bit low. It typically takes 3-5 minutes for a node or master to start. Succeeds as expected: https://gist.github.com/chrislovecnm/9cf1d2644851aa04ed40dc23fce92c31
  2. Testing with 2s interval which is nuts. This fails as it should. We validate the cluster 8 extra times, and the rolling-update stops. This is expected as documented see --validate-retries. Here are the results: https://gist.github.com/chrislovecnm/f5dd79d20724d54f787673563cc1ff75

We should warn if a user is setting the interval under 3 minutes, but I am unable to recreate the problems you are encountering.

What application do you have installed? Am I noticing Prometheus?

Can we setup a time to work through this? I have done a lot of testing with this code, and I know other people are using it in production. How do we proceed?

@k8s-github-robot k8s-github-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2017
@justinsb
Copy link
Member

justinsb commented Sep 14, 2017

So I run with 2 second interval because I don't want to miss any edge cases. Compared to 2m interval, it's like running 60x more tests!

We discussed this on slack:

  • The --node-interval and --master-interval probably should keep their existing meaning, which is the minimum interval to wait between cycling nodes. So if you want to do a slow cycle, you can set them to 1h and run it over the weekend.
  • I don't see a reason to expose the validation / poll interval. I think it's fine to keep this at 2 minutes if you want, though I think something more like 30 seconds is probably going to be faster. I agree it doesn't need to be 2 seconds - that is just because it uncovers challenges.
  • I think you want a --validation-timeout setting - i.e. a total time. This lets us adopt more advanced timing strategies in future, like watching for the instance to be started before starting a poll.

What do you think?

@justinsb
Copy link
Member

Another option is to treat --master-interval and --node-interval as the maximum interval between cycling nodes, i.e. the validation timeout. This is nice in terms of compatibility, because you don't need any more flags, and we just proceed once we know the cluster is stable. But on the other hand, setting the minimum interval gives us a pod-disruption-budget style handling, where we say we don't want to cycle the cluster as fast as we can, because the applications need a little longer to be totally healthy (e.g. just because Cassandra is running doesn't mean it isn't doing a repair).

@blakebarnett
Copy link

I forgot to create an issue with details, but randomly I have seen rolling-update choose to do the nodes first. I'm completely baffled by how this can happen based on the code, but it has definitely happened at least twice for me, once on a production cluster. If I hadn't been paying close attention it probably would have caused a major outage.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 24, 2017
export KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate"
# do not fail if the cluster does not validate
# wait 8 min to create new node, and at least 8 min
# to validate the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on with these indents? Not a merge blocker, but is this deliberate?

for i := 0; i <= rollingUpdateData.ValidateRetries; i++ {
// ValidateClusterWithDuration runs validation.ValidateCluster until either we get positive result or the timeout expires
func (n *CloudInstanceGroup) ValidateClusterWithDuration(rollingUpdateData *RollingUpdateCluster, instanceGroupList *api.InstanceGroupList, duration time.Duration) error {
// TODO should we expose this to the UI?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably once a use-case is demonstrated for doing so, but until then, no :-)

select {
case <-timeout:
// Got a timeout fail with a timeout error
return fmt.Errorf("cluster did not validate within a duation of %q", duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: duation

If rolling-update does not report that the cluster needs to be rolled you can force the cluster to be
rolled with the force flag. Rolling update drains and validates the cluster by default. A cluster is
deemed validated when all required nodes are running, and all pods in the kube-system namespace are operational.
When a node is deleted rolling-update sleeps the interval for the node type, and the tries for the same period
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: s/the/then

@justinsb
Copy link
Member

LGTM 🎉

return fmt.Errorf("cluster validation failed: %v", err)
func (n *CloudInstanceGroup) tryValidateCluster(rollingUpdateData *RollingUpdateCluster, instanceGroupList *api.InstanceGroupList, duration time.Duration, tickDuration time.Duration) bool {
if _, err := validation.ValidateCluster(rollingUpdateData.ClusterName, instanceGroupList, rollingUpdateData.K8sClient); err != nil {
glog.Infof("Cluster did not validate, will try again in %q util duration %q expires: %v.", tickDuration, duration, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type: util

@justinsb
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2017
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@chrislovecnm
Copy link
Contributor Author

I will file an issue for the typos, and the indenting is removed when the markdown is generated.

@k8s-github-robot
Copy link

/lgtm cancel //PR changed after LGTM, removing LGTM. @chrislovecnm @justinsb

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2017
@justinsb
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. .

@k8s-github-robot k8s-github-robot merged commit ba42020 into kubernetes:master Sep 24, 2017
@chrislovecnm chrislovecnm deleted the promote-drain-validate branch September 24, 2017 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants