Setting timeouts to 2s can cause issue with rolling update #2407

chrislovecnm · 2017-04-22T18:23:21Z

I need to test this. This issue is terse because it is a not for me ;)

Once I get this closed I will flip the feature flag.

justinsb · 2017-09-03T13:18:10Z

Got this one with --master-interval=2s --node-interval=2s (I didn't change --drain-interval but realize I should have done this one also).

Failed to drain node "ip-172-20-35-135.ec2.internal": error draining node: Get https://api.simple.example.com/apis/apps/v1beta1/namespaces/monitoring/statefulsets/prometheus-k8s: dial tcp 34.233.120.84:443: i/o timeout: prometheus-k8s-1, prometheus-k8s-1; Get https://api.simple.example.com/apis/extensions/v1beta1/namespaces/kube-system/replicasets/kubernetes-dashboard-2465510325: unexpected EOF: kubernetes-dashboard-2465510325-ns5lx; Get https://api.simple.example.com/apis/extensions/v1beta1/namespaces/kube-system/replicasets/kubernetes-dashboard-2465510325: dial tcp 34.233.120.84:443: getsockopt: connection refused: kubernetes-dashboard-2465510325-ns5lx; Get https://api.simple.example.com/apis/extensions/v1beta1/namespaces/kube-tls-acme/replicasets/kube-lego-822705050: dial tcp 34.233.120.84:443: i/o timeout: kube-lego-822705050-g3rck, kube-lego-822705050-g3rck; Get https://api.simple.example.com/apis/apps/v1beta1/namespaces/monitoring/statefulsets/alertmanager-main: dial tcp 34.233.120.84:443: i/o timeout: alertmanager-main-1, alertmanager-main-1

The reason I think we have to address this is because when the interval is long we "get lucky" and avoid these transient conditions ... usually. But in our larger user-base, we won't be so lucky.

justinsb · 2017-09-03T13:20:50Z

And then:

I0903 09:20:07.806821   20463 node_api_adapter.go:197] Couldn't find condition NetworkUnavailable on node ip-172-20-32-30.ec2.internal
I0903 09:20:07.806851   20463 node_api_adapter.go:197] Couldn't find condition NetworkUnavailable on node ip-172-20-39-7.ec2.internal
I0903 09:20:07.806861   20463 node_api_adapter.go:197] Couldn't find condition NetworkUnavailable on node ip-172-20-80-186.ec2.internal

Although that might have been introduced in 1.7 or 1.7.5 ...

chrislovecnm · 2017-09-03T20:42:11Z

What clusterspec are u using? Enough people are using it that we just be really lucky. We have gotten few issues specifically with this feature.

Regardless we need to be warning or forcing a user to use reasonable times. Any ideas what is happening?

I am guessing that drain and validate is moving too fast before the master is actually up.

chrislovecnm · 2017-09-04T17:44:10Z

@justinsb a cluster spec would be helpful.

I do have a hypothesis.

this is only happening in single masters. We have two other masters that are up.
we should be checking the health check endpoint first
this is occurring because the API server and other components are flapping. The term I am referring to is used by monitoring components like Nagios. https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/flapping.html

@blakebarnett @gambol99 @KashifSaadat @alrs @sethpollack

Can anyone recommend a good pattern for detecting flapping services?

I think the following services should not be flapping to detect that a master node has been upgraded properly.

Etcd
API Server
Scheduler
dns-controller
k-c-m
cni provider

My recommendation is that we determine all 5 of the components are not flapping before we proceed to the next masters, or to the nodes.

chrislovecnm · 2017-09-04T17:54:11Z

Here is an idea: http://linuxczar.net/blog/2016/01/31/flap-detection/

Warning Math ... ugh math

chrislovecnm · 2017-09-05T01:45:11Z

@justinsb

Ran test with 2m interval, here are the results: https://gist.github.com/chrislovecnm/9cf1d2644851aa04ed40dc23fce92c31

Looks great :)

Let's test the insane 2s interval.

chrislovecnm · 2017-09-05T02:05:53Z

Here are the results with a 2s interval. It fails as expected, due to exceeding default number of validation attempts. Here are the results: https://gist.github.com/chrislovecnm/f5dd79d20724d54f787673563cc1ff75

We only try to validate so many times for each node. As documented:

      --validate-retries int         The number of times that a node will be validated.  Between validation kops sleeps the master-interval/2 or node-interval/2 duration. (default 8)

chrislovecnm · 2017-09-05T02:41:58Z

@justinsb how can I break the code in the manner you are? The 2s interval causes the validation to time out as expected.

Automatic merge from submit-queue. . promoting drain and validate by setting feature flag to true I am unable to recreate #2407, and frankly, it may be an edge case. We could warn a user if their wait times are low, but that would be another PR. This PR moves Drain and Validate functionality for rolling-updates into the default user experience, setting the Feature Flag to true. Per feedback, I am using the node and master interval times for the validation.

chrislovecnm · 2017-09-24T18:47:32Z

closing this ... non issue

chrislovecnm added the area/rolling-update label Apr 22, 2017

chrislovecnm self-assigned this Apr 22, 2017

chrislovecnm mentioned this issue Sep 2, 2017

promoting drain and validate by setting feature flag to true #3329

Merged

chrislovecnm closed this as completed Sep 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting timeouts to 2s can cause issue with rolling update #2407

Setting timeouts to 2s can cause issue with rolling update #2407

chrislovecnm commented Apr 22, 2017

justinsb commented Sep 3, 2017

justinsb commented Sep 3, 2017

chrislovecnm commented Sep 3, 2017

chrislovecnm commented Sep 4, 2017

chrislovecnm commented Sep 4, 2017

chrislovecnm commented Sep 5, 2017

chrislovecnm commented Sep 5, 2017

chrislovecnm commented Sep 5, 2017

chrislovecnm commented Sep 24, 2017

Setting timeouts to 2s can cause issue with rolling update #2407

Setting timeouts to 2s can cause issue with rolling update #2407

Comments

chrislovecnm commented Apr 22, 2017

justinsb commented Sep 3, 2017

justinsb commented Sep 3, 2017

chrislovecnm commented Sep 3, 2017

chrislovecnm commented Sep 4, 2017

chrislovecnm commented Sep 4, 2017

chrislovecnm commented Sep 5, 2017

chrislovecnm commented Sep 5, 2017

chrislovecnm commented Sep 5, 2017

chrislovecnm commented Sep 24, 2017