-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix situation when scale up overshot 33% broken nodes while hitting qouta #547
Comments
I've experimented a bit and I think it should be enough to perform removing old unregistered nodes before checking cluster health. This way, when we want to scale up a node group more than 3x over quota we will have a failed scale up after 15 minutes (this is when we decide that the nodes will not register themselves), the node group will back off scale up, but the cluster will stay healthy and autoscaler will keep trying to scale up/down. This means that
It would be great to also add more messaging around this (AFAIK right now there is no explicit suggestion for the user to check quota). |
I'm not convinced this solves all the problems. Aren't we going to be stuck in broken state for 15 minutes after a failed scale-up? Maybe we shouldn't count longUnregistered nodes towards cluster being unready for autoscaling? I'm not sure just wondering if there is any good reason for doing it? I added it myself when I added tracking longUnregistered nodes, but I'm not convinced it should be that way. IIRC I did it because it seemed intuitively logical to count them as broken nodes, but I don't think I though that part out too deeply. I still haven't, so if you think I'm wrong I'm ok with your solution. === The latter part of this comment is my thoughts on how we should rethink our handling of clusters with multiple nodes in bad state. This would be a larger effort and we shouldn't wait for it with fixing this issue. So I'm happy with any tactical fix that solves the problem at hand. === Taking a step back. The backoff if too many nodes were broken was added a long time ago. I suspect the logic behind it was along the lines of "CA is in early beta and we don't think it can handle large-scale cluster failures gracefully; let's just avoid causing more damage" (am I right @mwielgus?). Anyway it was before we considered things like quotas and multizone. Maybe we should rethink if it still makes sense. Consider a total zone failure scenario. In 3 zone cluster statistically 1/3 nodes will become unready, causing CA backoff just as it is most needed. Maybe we should approach it more methodically and list different node failure scenarios and how we want CA to handle them. I suspect we may find out that we should move away from total backoff to per-nodegroup handling, similar to how we currently backoff nodegroup after failed scale-up. |
Re "Aren't we going to be stuck in broken state for 15 minutes after a failed scale-up": the cluster never actually goes into Unhealthy state. After a failed scale up, we remove the unregistered nodes almost immediately (that's what I observed), nodes are upcoming for 15 minutes and this counts towards the node provisioning time (counted from the moment the node is first seen), which is also by default 15 minutes. So it goes like this:
I think it makes sense to count the nodes that we fail to delete towards unhealthy nodes in the cluster as if we don't delete them, the Instance Group is in a funny state (I'm not entirely sure it's easily recoverable). It should be infrequent and unexpected enough that the deletion operation will fail and we will subsequently retry deleting them, so even if the cluster will be in an unhealthy state for a while because of undeletable unregistered nodes, we will be able to come out of ot if the failure is transient. WDYT? === I was thinking about the multi-zone scenario too and it is definitely a motivation to revisit this. I agree that in most (all?) cases it make more sense to backoff individual nodegroups than the whole cluster. I like the idea to approach this methodically also because it would give us a chance to communicate more clearly the state of cluster from autoscaling perspective to the users (as you mentioned in relation to the situation happening in this issue). |
Right, I forgot both timeouts are the same. Let's go with your solution as a fix for this, so we can include it in next patch release. And let's revisit the general case afterwards. |
Add default namespace to job in /samples
No description provided.
The text was updated successfully, but these errors were encountered: