remove nodes stuck creating even if this takes us below nodeGroup minSize #4357
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is an attempt to resolve #4351. The AWS cloudprovider doesn't currently set the
InstanceStatus
field oncloudprovider.Instance
, so in this change we start populating this field. We set it toInstanceRunning
except when the node is a placeholder node, in which case we set it toInstanceCreating
. Then, in theremoveOldUnregisteredNodes
function, if removing the node would take us below the node group'sminSize
, but the status is "creating", we go ahead and remove the node anyways. The logic here is that if the node is stuck in creating for 15 minutes (or whatever the max node provision time is), we're already below the node group's minimum size, but CA just doesn't think it is.I've added a case to the
TestStaticAutoscalerRunOnceWithLongUnregisteredNodes
test case to confirm that this works as expected.I think this is a relatively simple/safe change. My only slight concern is that I'm not sure how other cloud providers that use the
InstanceStatus
field will respond to this change. It looks like the other cloud providers that use theInstanceCreating
value are:I don't know a good way to test the behaviour of these other cloud providers. :\
The other thing I was unsure about was what to do in the
ClusterStateRegistry
when an instance changes state. My intuition was that if a node transitions from (say) "creating" to "running", that it would make sense to reset the node's "unregistered-since" timer, but since we currently don't track the state transitions at all, this would be a change from the existing behaviour that might be unexpected.