Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unhealthyMachineTimeout not working when VM is powered off (VM not deleted from disk) #8785

Open
saiteja313 opened this issue Sep 17, 2024 · 0 comments
Milestone

Comments

@saiteja313
Copy link
Contributor

saiteja313 commented Sep 17, 2024

What happened:

I have created EKSA Cluster with following configuration,

  1. unhealthyMachineTimeout set to 30 seconds (minimum value) in the Cluster config file Worker node section
  2. Enabled Autoscaling configuration in cluster config file for worker nodes
  3. Installed Cluster Autoscaler curated package on the cluster

I went through two scenarios post cluster creation,

  1. Scenario 1: Navigate to VMWare vSphere console, Click on one of worker node, Right Click and Power Off
  2. Scenario 2: Click on one of worker node, Right Click > Power Off, Right click again > Delete from the disk

Scenario 1 fails all the time. No new node is created. capv pod logs do not show any event that node is unhealthy until 4-5 minutes. And then, node either gets deleted and new node is provisioned or node gets powered on.

Scenario 2 works all the time. Post deletion of node, new node gets provisioned within 30 seconds.

[1] https://anywhere.eks.amazonaws.com/docs/getting-started/optional/healthchecks/#__machinehealthcheckunhealthymachinetimeout__-optional

What you expected to happen:

For scenario 1, capv should respect unhealthyMachineTimeout 30 seconds value. When unhealthyMachineTimeout is set to 5 minutes, capv takes around 20-40 minutes to realize the node is powered off or not ready.

I am not sure if we need something like a node termination handler that Amazon EKS on cloud has.

How to reproduce it (as minimally and precisely as possible):

  • Configure worker node section of Cluster config file as following.
  workerNodeGroupConfigurations:
  - count: 1
    machineGroupRef:
      kind: VSphereMachineConfig
      name: demo-mgmt
    name: md-0
    autoscalingConfiguration:
      minCount: 1
      maxCount: 5
    machineHealthCheck:        
      unhealthyMachineTimeout: 30s
      maxUnhealthy: 100%

Anything else we need to know?:

Environment: EKSA with vSphere

  • EKS Anywhere Release: 0.20
Version: v0.20.4
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/74/manifest.yaml 
  • EKS Distro Release: not sure
@saiteja313 saiteja313 changed the title unhealthyMachineTimeout not working when VM is powered off and VM not deleted from the disk unhealthyMachineTimeout not working when VM is powered off (VM not deleted from disk) Sep 17, 2024
@vivek-koppuru vivek-koppuru added this to the v0.21.0 milestone Sep 18, 2024
@vivek-koppuru vivek-koppuru modified the milestones: v0.21.0, v0.22.0 Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants