-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promoting drain and validate by setting feature flag to true #3329
promoting drain and validate by setting feature flag to true #3329
Conversation
lgtm ... |
Drain and validate are still broken for me with low intervals. I'll verify and post output |
@justinsb should we force reasonable intervals or tell the user to do cloud only? Drain and validate is not made for low intervals. Doing a update in production is not something that is super fast in my opinion. |
@justinsb I have completed some testing tonight.
We should warn if a user is setting the interval under 3 minutes, but I am unable to recreate the problems you are encountering. What application do you have installed? Am I noticing Prometheus? Can we setup a time to work through this? I have done a lot of testing with this code, and I know other people are using it in production. How do we proceed? |
So I run with 2 second interval because I don't want to miss any edge cases. Compared to 2m interval, it's like running 60x more tests! We discussed this on slack:
What do you think? |
Another option is to treat |
I forgot to create an issue with details, but randomly I have seen rolling-update choose to do the nodes first. I'm completely baffled by how this can happen based on the code, but it has definitely happened at least twice for me, once on a production cluster. If I hadn't been paying close attention it probably would have caused a major outage. |
be9136c
to
acb5e8b
Compare
export KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" | ||
# do not fail if the cluster does not validate | ||
# wait 8 min to create new node, and at least 8 min | ||
# to validate the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on with these indents? Not a merge blocker, but is this deliberate?
for i := 0; i <= rollingUpdateData.ValidateRetries; i++ { | ||
// ValidateClusterWithDuration runs validation.ValidateCluster until either we get positive result or the timeout expires | ||
func (n *CloudInstanceGroup) ValidateClusterWithDuration(rollingUpdateData *RollingUpdateCluster, instanceGroupList *api.InstanceGroupList, duration time.Duration) error { | ||
// TODO should we expose this to the UI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably once a use-case is demonstrated for doing so, but until then, no :-)
select { | ||
case <-timeout: | ||
// Got a timeout fail with a timeout error | ||
return fmt.Errorf("cluster did not validate within a duation of %q", duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: duation
If rolling-update does not report that the cluster needs to be rolled you can force the cluster to be | ||
rolled with the force flag. Rolling update drains and validates the cluster by default. A cluster is | ||
deemed validated when all required nodes are running, and all pods in the kube-system namespace are operational. | ||
When a node is deleted rolling-update sleeps the interval for the node type, and the tries for the same period |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: s/the/then
LGTM 🎉 |
return fmt.Errorf("cluster validation failed: %v", err) | ||
func (n *CloudInstanceGroup) tryValidateCluster(rollingUpdateData *RollingUpdateCluster, instanceGroupList *api.InstanceGroupList, duration time.Duration, tickDuration time.Duration) bool { | ||
if _, err := validation.ValidateCluster(rollingUpdateData.ClusterName, instanceGroupList, rollingUpdateData.K8sClient); err != nil { | ||
glog.Infof("Cluster did not validate, will try again in %q util duration %q expires: %v.", tickDuration, duration, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: util
/lgtm |
/test all [submit-queue is verifying that this PR is safe to merge] |
I will file an issue for the typos, and the indenting is removed when the markdown is generated. |
/lgtm cancel //PR changed after LGTM, removing LGTM. @chrislovecnm @justinsb |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: justinsb The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. . |
I am unable to recreate #2407, and frankly, it may be an edge case. We could warn a user if their wait times are low, but that would be another PR.
This PR moves Drain and Validate functionality for rolling-updates into the default user experience, setting the Feature Flag to true.
Per feedback, I am using the node and master interval times for the validation.