Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterAPI Provider: Provide fake proivder IDs for failed Machines #2983

Merged
merged 2 commits into from
Apr 14, 2020

Conversation

JoelSpeed
Copy link
Contributor

This is based on the spot instance handling from the AWS provider (#2235).

Some ClusterAPI implementations will mark a Machine as failed if there is a problem creating the instance on the cloud provider. In this case, these instances weren't being considered as bad and the autoscaler considered them to be comingUp. With this patch in, after the maxNodeProvisionTime is reached, the autoscaler marks the nodegroup as unhealthy, scales it down (removing the unhealthy machine), and then tries to scale up an alternate group, matching the behaviour seen with the AWS provider once that was patched

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 26, 2020
@k8s-ci-robot k8s-ci-robot requested review from frobware and ncdc March 26, 2020 14:41
@JoelSpeed JoelSpeed force-pushed the simulate-failed-machines branch from 0fb8c6c to 70ac43c Compare March 26, 2020 18:36
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 26, 2020
@JoelSpeed JoelSpeed force-pushed the simulate-failed-machines branch 2 times, most recently from 1f9b472 to eaeabe2 Compare March 27, 2020 11:19
@@ -422,6 +443,15 @@ func (c *machineController) machineSetProviderIDs(machineSet *MachineSet) ([]str
continue
}

if machine.Status.Phase == "Failed" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be compare against ErrorMessage instead?

@enxebre
Copy link
Member

enxebre commented Mar 27, 2020

/area provider/cluster-api

@k8s-ci-robot k8s-ci-robot added the area/provider/cluster-api Issues or PRs related to Cluster API provider label Mar 27, 2020
@JoelSpeed JoelSpeed force-pushed the simulate-failed-machines branch 2 times, most recently from 82ba453 to b0b20a1 Compare March 30, 2020 12:28
@enxebre
Copy link
Member

enxebre commented Apr 2, 2020

nit: do you think we can make conflate Use FailureMessage instead of Phase to determine failed Machines with the original commit? Tracking that change seems meaningless since it never made it in.

@@ -422,6 +443,15 @@ func (c *machineController) machineSetProviderIDs(machineSet *MachineSet) ([]str
continue
}

if machine.Status.FailureMessage != nil {
klog.V(4).Infof("Status.FailureMessage of machine %q is %q", machine.Name, *machine.Status.FailureMessage)
// Provide a fake ID that can be recognised later and converted into a machine key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be add the "why" i.e so the machine is tracked by the nodegroup and the maxNodeProvisionTime can kicks in

@JoelSpeed JoelSpeed force-pushed the simulate-failed-machines branch from b0b20a1 to d23d3a1 Compare April 2, 2020 14:25
@JoelSpeed
Copy link
Contributor Author

@enxebre I addressed your feedback, PTAL

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 14, 2020
@enxebre
Copy link
Member

enxebre commented Apr 14, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2020
@k8s-ci-robot k8s-ci-robot merged commit 72b08db into kubernetes:master Apr 14, 2020
k8s-ci-robot added a commit that referenced this pull request Jun 3, 2020
[CA-1.18] #2983 cherry-pick: ClusterAPI Provider: Provide fake proivder IDs for failed Machines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants