Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix situation when scale up overshot 33% broken nodes while hitting qouta #547

Closed
mwielgus opened this issue Jan 15, 2018 · 4 comments
Closed

Comments

@mwielgus
Copy link
Contributor

No description provided.

@bskiba
Copy link
Member

bskiba commented Feb 1, 2018

I've experimented a bit and I think it should be enough to perform removing old unregistered nodes before checking cluster health.

This way, when we want to scale up a node group more than 3x over quota we will have a failed scale up after 15 minutes (this is when we decide that the nodes will not register themselves), the node group will back off scale up, but the cluster will stay healthy and autoscaler will keep trying to scale up/down.

This means that

  1. If the user removes the load from the cluster, autoscaler will be able to notice and stop trying to scale up and, if needed, scale the cluster down, with no additional action from the user
  2. If the user raises quota to accommodate load, CA will try scaling the node group up again and succeed (I have yet to confirm this one on a test cluster)
    @MaciekPytel, let me know if you think I'm missing something here.

It would be great to also add more messaging around this (AFAIK right now there is no explicit suggestion for the user to check quota).

@MaciekPytel
Copy link
Contributor

I'm not convinced this solves all the problems. Aren't we going to be stuck in broken state for 15 minutes after a failed scale-up? Maybe we shouldn't count longUnregistered nodes towards cluster being unready for autoscaling? I'm not sure just wondering if there is any good reason for doing it? I added it myself when I added tracking longUnregistered nodes, but I'm not convinced it should be that way. IIRC I did it because it seemed intuitively logical to count them as broken nodes, but I don't think I though that part out too deeply. I still haven't, so if you think I'm wrong I'm ok with your solution.

===

The latter part of this comment is my thoughts on how we should rethink our handling of clusters with multiple nodes in bad state. This would be a larger effort and we shouldn't wait for it with fixing this issue. So I'm happy with any tactical fix that solves the problem at hand.

===

Taking a step back. The backoff if too many nodes were broken was added a long time ago. I suspect the logic behind it was along the lines of "CA is in early beta and we don't think it can handle large-scale cluster failures gracefully; let's just avoid causing more damage" (am I right @mwielgus?). Anyway it was before we considered things like quotas and multizone. Maybe we should rethink if it still makes sense. Consider a total zone failure scenario. In 3 zone cluster statistically 1/3 nodes will become unready, causing CA backoff just as it is most needed.

Maybe we should approach it more methodically and list different node failure scenarios and how we want CA to handle them. I suspect we may find out that we should move away from total backoff to per-nodegroup handling, similar to how we currently backoff nodegroup after failed scale-up.

WDYT @mwielgus @bskiba @aleksandra-malinowska?

@bskiba
Copy link
Member

bskiba commented Feb 1, 2018

Re "Aren't we going to be stuck in broken state for 15 minutes after a failed scale-up": the cluster never actually goes into Unhealthy state. After a failed scale up, we remove the unregistered nodes almost immediately (that's what I observed), nodes are upcoming for 15 minutes and this counts towards the node provisioning time (counted from the moment the node is first seen), which is also by default 15 minutes.

So it goes like this:

  1. Nodes are upcoming for ~15 minutes,
  2. We deem the scale up failed, emit event, backoff the node group
  3. We check that the nodes we tried to create have been unregistered for 15 minutes so nodes are removed
  4. After backoff period, we try to scale up again and come back to 1.

I think it makes sense to count the nodes that we fail to delete towards unhealthy nodes in the cluster as if we don't delete them, the Instance Group is in a funny state (I'm not entirely sure it's easily recoverable). It should be infrequent and unexpected enough that the deletion operation will fail and we will subsequently retry deleting them, so even if the cluster will be in an unhealthy state for a while because of undeletable unregistered nodes, we will be able to come out of ot if the failure is transient. WDYT?

===

I was thinking about the multi-zone scenario too and it is definitely a motivation to revisit this. I agree that in most (all?) cases it make more sense to backoff individual nodegroups than the whole cluster. I like the idea to approach this methodically also because it would give us a chance to communicate more clearly the state of cluster from autoscaling perspective to the users (as you mentioned in relation to the situation happening in this issue).

@MaciekPytel
Copy link
Contributor

Right, I forgot both timeouts are the same. Let's go with your solution as a fix for this, so we can include it in next patch release. And let's revisit the general case afterwards.

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
Add default namespace to job in /samples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants