Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller is not scaling-up degraded control plane #352

Open
zioc opened this issue Jun 25, 2024 · 3 comments
Open

Controller is not scaling-up degraded control plane #352

zioc opened this issue Jun 25, 2024 · 3 comments
Labels
kind/bug Something isn't working lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@zioc
Copy link

zioc commented Jun 25, 2024

What happened:

On a degraded cluster, 2 control-plane nodes out of 3 were unhealthy. One CP machine has been deleted, but it has not been re-created by control plane controller, we were observing the following logs:

Scaling up control plane" "Desired"=3 "Existing"=2 "RKE2ControlPlane

But right after that, we can see that control plane was not scaled up because of the following check:

"Waiting for control plane to pass preflight checks" [...] "failures"="machine management-cluster-control-plane-dklp7 reports AgentHealthy condition is false (Error, Missing node)"

Is is on purpose? Wouldn't it be legitimate to scale-up control-plane anyway in such cases? Even if some node is not healthy, wouldn't it be worth creating a new machine to match the requested number of replicas?

Here a more complete log, we see that machine has been generated as soon as the second unhealthy CP node has been deleted (at the end):

[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:45:24.724005       1 rke2controlplane_controller.go:387]  "msg"="Reconcile RKE2 Control Plane" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="fb967ad8-a26d-4431-bef7-5ae459a10cde"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:45:24.775706       1 rke2controlplane_controller.go:510]  "msg"="Scaling up control plane" "Desired"=3 "Existing"=2 "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="fb967ad8-a26d-4431-bef7-5ae459a10cde"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:45:24.775772       1 scale.go:225]  "msg"="Waiting for control plane to pass preflight checks" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "failures"="machine management-cluster-control-plane-dklp7 reports AgentHealthy condition is false (Error, Missing node)" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="fb967ad8-a26d-4431-bef7-5ae459a10cde"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:01.000267       1 rke2controlplane_controller.go:387]  "msg"="Reconcile RKE2 Control Plane" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="c3a6ad0b-bfd1-4d74-99ad-62e6c7b94e5b"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:01.044557       1 rke2controlplane_controller.go:510]  "msg"="Scaling up control plane" "Desired"=3 "Existing"=2 "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="c3a6ad0b-bfd1-4d74-99ad-62e6c7b94e5b"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:01.045034       1 scale.go:225]  "msg"="Waiting for control plane to pass preflight checks" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "failures"="machine management-cluster-control-plane-dklp7 reports AgentHealthy condition is false (Error, Missing node)" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="c3a6ad0b-bfd1-4d74-99ad-62e6c7b94e5b"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:39.960985       1 rke2controlplane_controller.go:387]  "msg"="Reconcile RKE2 Control Plane" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="b6992d19-b1aa-4ea3-91e4-382a84f512d0"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:40.009671       1 rke2controlplane_controller.go:510]  "msg"="Scaling up control plane" "Desired"=3 "Existing"=2 "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="b6992d19-b1aa-4ea3-91e4-382a84f512d0"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:40.084152       1 scale.go:402]  "msg"="Version checking..." "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "machine-version: "="1.28.8" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="b6992d19-b1aa-4ea3-91e4-382a84f512d0" "rke2-version"="v1.28.8+rke2r1"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:40.084188       1 scale.go:425]  "msg"="generating machine:" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "machine-spec-version"="1.28.8" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="b6992d19-b1aa-4ea3-91e4-382a84f512d0"
[rke2-control-plane-controller-manager-588d666c5d-sjgnn] I0624 14:46:40.156976       1 rke2controlplane_controller.go:387]  "msg"="Reconcile RKE2 Control Plane" "RKE2ControlPlane"={"name":"management-cluster-control-plane","namespace":"sylva-system"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="management-cluster-control-plane" "namespace"="sylva-system" "reconcileID"="2040b8a7-d273-458c-805a-cf45dc0b2e57"

Environment:

sylva 1.1.0 rke2 + capo

  • rke provider version: v0.2.7
@zioc zioc added kind/bug Something isn't working needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 25, 2024
@dhanabal1
Copy link

Yes, I also observed the same issue, Why do we need to check all the master node's health statuses, if we have 1 healthy master node would it be good to proceed with scaling up the next node? , it blocks the parallel node provisioning as well—time delay when there is more number of master in cluster.

@alexander-demicev alexander-demicev added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 27, 2024
@alexander-demicev
Copy link
Member

@zioc Can you check if the issue still appears on latest version of CAPRKE2?

Copy link

This issue is stale because it has been open 90 days with no activity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants