Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting timeouts to 2s can cause issue with rolling update #2407

Closed
chrislovecnm opened this issue Apr 22, 2017 · 9 comments
Closed

Setting timeouts to 2s can cause issue with rolling update #2407

chrislovecnm opened this issue Apr 22, 2017 · 9 comments
Assignees

Comments

@chrislovecnm
Copy link
Contributor

I need to test this. This issue is terse because it is a not for me ;)

Once I get this closed I will flip the feature flag.

@justinsb
Copy link
Member

justinsb commented Sep 3, 2017

Got this one with --master-interval=2s --node-interval=2s (I didn't change --drain-interval but realize I should have done this one also).

Failed to drain node "ip-172-20-35-135.ec2.internal": error draining node: Get https://api.simple.example.com/apis/apps/v1beta1/namespaces/monitoring/statefulsets/prometheus-k8s: dial tcp 34.233.120.84:443: i/o timeout: prometheus-k8s-1, prometheus-k8s-1; Get https://api.simple.example.com/apis/extensions/v1beta1/namespaces/kube-system/replicasets/kubernetes-dashboard-2465510325: unexpected EOF: kubernetes-dashboard-2465510325-ns5lx; Get https://api.simple.example.com/apis/extensions/v1beta1/namespaces/kube-system/replicasets/kubernetes-dashboard-2465510325: dial tcp 34.233.120.84:443: getsockopt: connection refused: kubernetes-dashboard-2465510325-ns5lx; Get https://api.simple.example.com/apis/extensions/v1beta1/namespaces/kube-tls-acme/replicasets/kube-lego-822705050: dial tcp 34.233.120.84:443: i/o timeout: kube-lego-822705050-g3rck, kube-lego-822705050-g3rck; Get https://api.simple.example.com/apis/apps/v1beta1/namespaces/monitoring/statefulsets/alertmanager-main: dial tcp 34.233.120.84:443: i/o timeout: alertmanager-main-1, alertmanager-main-1

The reason I think we have to address this is because when the interval is long we "get lucky" and avoid these transient conditions ... usually. But in our larger user-base, we won't be so lucky.

@justinsb
Copy link
Member

justinsb commented Sep 3, 2017

And then:

I0903 09:20:07.806821   20463 node_api_adapter.go:197] Couldn't find condition NetworkUnavailable on node ip-172-20-32-30.ec2.internal
I0903 09:20:07.806851   20463 node_api_adapter.go:197] Couldn't find condition NetworkUnavailable on node ip-172-20-39-7.ec2.internal
I0903 09:20:07.806861   20463 node_api_adapter.go:197] Couldn't find condition NetworkUnavailable on node ip-172-20-80-186.ec2.internal

Although that might have been introduced in 1.7 or 1.7.5 ...

@chrislovecnm
Copy link
Contributor Author

What clusterspec are u using? Enough people are using it that we just be really lucky. We have gotten few issues specifically with this feature.

Regardless we need to be warning or forcing a user to use reasonable times. Any ideas what is happening?

I am guessing that drain and validate is moving too fast before the master is actually up.

@chrislovecnm
Copy link
Contributor Author

@justinsb a cluster spec would be helpful.

I do have a hypothesis.

  1. this is only happening in single masters. We have two other masters that are up.
  2. we should be checking the health check endpoint first
  3. this is occurring because the API server and other components are flapping. The term I am referring to is used by monitoring components like Nagios. https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/flapping.html

@blakebarnett @gambol99 @KashifSaadat @alrs @sethpollack

Can anyone recommend a good pattern for detecting flapping services?

I think the following services should not be flapping to detect that a master node has been upgraded properly.

  1. Etcd
  2. API Server
  3. Scheduler
  4. dns-controller
  5. k-c-m
  6. cni provider

My recommendation is that we determine all 5 of the components are not flapping before we proceed to the next masters, or to the nodes.

@chrislovecnm
Copy link
Contributor Author

Here is an idea: http://linuxczar.net/blog/2016/01/31/flap-detection/

Warning Math ... ugh math

@chrislovecnm
Copy link
Contributor Author

@justinsb

Ran test with 2m interval, here are the results: https://gist.github.com/chrislovecnm/9cf1d2644851aa04ed40dc23fce92c31

Looks great :)

Let's test the insane 2s interval.

@chrislovecnm
Copy link
Contributor Author

Here are the results with a 2s interval. It fails as expected, due to exceeding default number of validation attempts. Here are the results: https://gist.github.com/chrislovecnm/f5dd79d20724d54f787673563cc1ff75

We only try to validate so many times for each node. As documented:

      --validate-retries int         The number of times that a node will be validated.  Between validation kops sleeps the master-interval/2 or node-interval/2 duration. (default 8)

@chrislovecnm
Copy link
Contributor Author

@justinsb how can I break the code in the manner you are? The 2s interval causes the validation to time out as expected.

k8s-github-robot pushed a commit that referenced this issue Sep 24, 2017
Automatic merge from submit-queue. .

promoting drain and validate by setting feature flag to true

I am unable to recreate #2407, and frankly, it may be an edge case.  We could warn a user if their wait times are low, but that would be another PR.

This PR moves Drain and Validate functionality for rolling-updates into the default user experience, setting the Feature Flag to true.

Per feedback, I am using the node and master interval times for the validation.
@chrislovecnm
Copy link
Contributor Author

closing this ... non issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants