Fix situation when scale up overshot 33% broken nodes while hitting qouta #547

mwielgus · 2018-01-15T18:53:55Z

No description provided.

bskiba · 2018-02-01T14:56:11Z

I've experimented a bit and I think it should be enough to perform removing old unregistered nodes before checking cluster health.

This way, when we want to scale up a node group more than 3x over quota we will have a failed scale up after 15 minutes (this is when we decide that the nodes will not register themselves), the node group will back off scale up, but the cluster will stay healthy and autoscaler will keep trying to scale up/down.

This means that

If the user removes the load from the cluster, autoscaler will be able to notice and stop trying to scale up and, if needed, scale the cluster down, with no additional action from the user
If the user raises quota to accommodate load, CA will try scaling the node group up again and succeed (I have yet to confirm this one on a test cluster)
@MaciekPytel, let me know if you think I'm missing something here.

It would be great to also add more messaging around this (AFAIK right now there is no explicit suggestion for the user to check quota).

MaciekPytel · 2018-02-01T15:52:14Z

I'm not convinced this solves all the problems. Aren't we going to be stuck in broken state for 15 minutes after a failed scale-up? Maybe we shouldn't count longUnregistered nodes towards cluster being unready for autoscaling? I'm not sure just wondering if there is any good reason for doing it? I added it myself when I added tracking longUnregistered nodes, but I'm not convinced it should be that way. IIRC I did it because it seemed intuitively logical to count them as broken nodes, but I don't think I though that part out too deeply. I still haven't, so if you think I'm wrong I'm ok with your solution.

===

The latter part of this comment is my thoughts on how we should rethink our handling of clusters with multiple nodes in bad state. This would be a larger effort and we shouldn't wait for it with fixing this issue. So I'm happy with any tactical fix that solves the problem at hand.

===

Taking a step back. The backoff if too many nodes were broken was added a long time ago. I suspect the logic behind it was along the lines of "CA is in early beta and we don't think it can handle large-scale cluster failures gracefully; let's just avoid causing more damage" (am I right @mwielgus?). Anyway it was before we considered things like quotas and multizone. Maybe we should rethink if it still makes sense. Consider a total zone failure scenario. In 3 zone cluster statistically 1/3 nodes will become unready, causing CA backoff just as it is most needed.

Maybe we should approach it more methodically and list different node failure scenarios and how we want CA to handle them. I suspect we may find out that we should move away from total backoff to per-nodegroup handling, similar to how we currently backoff nodegroup after failed scale-up.

WDYT @mwielgus @bskiba @aleksandra-malinowska?

bskiba · 2018-02-01T16:13:32Z

Re "Aren't we going to be stuck in broken state for 15 minutes after a failed scale-up": the cluster never actually goes into Unhealthy state. After a failed scale up, we remove the unregistered nodes almost immediately (that's what I observed), nodes are upcoming for 15 minutes and this counts towards the node provisioning time (counted from the moment the node is first seen), which is also by default 15 minutes.

So it goes like this:

Nodes are upcoming for ~15 minutes,
We deem the scale up failed, emit event, backoff the node group
We check that the nodes we tried to create have been unregistered for 15 minutes so nodes are removed
After backoff period, we try to scale up again and come back to 1.

I think it makes sense to count the nodes that we fail to delete towards unhealthy nodes in the cluster as if we don't delete them, the Instance Group is in a funny state (I'm not entirely sure it's easily recoverable). It should be infrequent and unexpected enough that the deletion operation will fail and we will subsequently retry deleting them, so even if the cluster will be in an unhealthy state for a while because of undeletable unregistered nodes, we will be able to come out of ot if the failure is transient. WDYT?

===

I was thinking about the multi-zone scenario too and it is definitely a motivation to revisit this. I agree that in most (all?) cases it make more sense to backoff individual nodegroups than the whole cluster. I like the idea to approach this methodically also because it would give us a chance to communicate more clearly the state of cluster from autoscaling perspective to the users (as you mentioned in relation to the situation happening in this issue).

MaciekPytel · 2018-02-01T16:21:34Z

Right, I forgot both timeouts are the same. Let's go with your solution as a fix for this, so we can include it in next patch release. And let's revisit the general case afterwards.

Add default namespace to job in /samples

aleksandra-malinowska added the area/cluster-autoscaler label Jan 19, 2018

bskiba self-assigned this Jan 31, 2018

bskiba added the bug label Feb 1, 2018

bskiba mentioned this issue Feb 1, 2018

Remove old unregistered nodes before checking cluster healthiness #602

Merged

bskiba closed this as completed in #602 Feb 2, 2018

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

Merge pull request kubernetes#547 from kerthcet/cleanup/fix-sample

c85104a

Add default namespace to job in /samples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix situation when scale up overshot 33% broken nodes while hitting qouta #547

Fix situation when scale up overshot 33% broken nodes while hitting qouta #547

mwielgus commented Jan 15, 2018

bskiba commented Feb 1, 2018 •

edited

Loading

MaciekPytel commented Feb 1, 2018

bskiba commented Feb 1, 2018 •

edited

Loading

MaciekPytel commented Feb 1, 2018

Fix situation when scale up overshot 33% broken nodes while hitting qouta #547

Fix situation when scale up overshot 33% broken nodes while hitting qouta #547

Comments

mwielgus commented Jan 15, 2018

bskiba commented Feb 1, 2018 • edited Loading

MaciekPytel commented Feb 1, 2018

bskiba commented Feb 1, 2018 • edited Loading

MaciekPytel commented Feb 1, 2018

bskiba commented Feb 1, 2018 •

edited

Loading

bskiba commented Feb 1, 2018 •

edited

Loading