-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting timeouts to 2s can cause issue with rolling update #2407
Comments
Got this one with
The reason I think we have to address this is because when the interval is long we "get lucky" and avoid these transient conditions ... usually. But in our larger user-base, we won't be so lucky. |
And then:
Although that might have been introduced in 1.7 or 1.7.5 ... |
What clusterspec are u using? Enough people are using it that we just be really lucky. We have gotten few issues specifically with this feature. Regardless we need to be warning or forcing a user to use reasonable times. Any ideas what is happening? I am guessing that drain and validate is moving too fast before the master is actually up. |
@justinsb a cluster spec would be helpful. I do have a hypothesis.
@blakebarnett @gambol99 @KashifSaadat @alrs @sethpollack Can anyone recommend a good pattern for detecting flapping services? I think the following services should not be flapping to detect that a master node has been upgraded properly.
My recommendation is that we determine all 5 of the components are not flapping before we proceed to the next masters, or to the nodes. |
Here is an idea: http://linuxczar.net/blog/2016/01/31/flap-detection/ Warning Math ... ugh math |
Ran test with 2m interval, here are the results: https://gist.github.com/chrislovecnm/9cf1d2644851aa04ed40dc23fce92c31 Looks great :) Let's test the insane 2s interval. |
Here are the results with a 2s interval. It fails as expected, due to exceeding default number of validation attempts. Here are the results: https://gist.github.com/chrislovecnm/f5dd79d20724d54f787673563cc1ff75 We only try to validate so many times for each node. As documented:
|
@justinsb how can I break the code in the manner you are? The 2s interval causes the validation to time out as expected. |
Automatic merge from submit-queue. . promoting drain and validate by setting feature flag to true I am unable to recreate #2407, and frankly, it may be an edge case. We could warn a user if their wait times are low, but that would be another PR. This PR moves Drain and Validate functionality for rolling-updates into the default user experience, setting the Feature Flag to true. Per feedback, I am using the node and master interval times for the validation.
closing this ... non issue |
I need to test this. This issue is terse because it is a not for me ;)
Once I get this closed I will flip the feature flag.
The text was updated successfully, but these errors were encountered: