Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race in deleting cluster secret prevents deletion of NutanixCluster object when deleting CAPI cluster object #281

Open
prajnutanix opened this issue May 12, 2023 · 1 comment

Comments

@prajnutanix
Copy link

/kind bug

What steps did you take and what happened:

  1. Create kind cluster as a management cluster for this bug repro. Run clusterctl init --infrastructure v1.2.1 to initialize the providers. Make sure that ~/.cluster-api/clusterctl.yaml contains relevant values needed for cluster initialization, before running this command.
  2. use clusterctl to generate a cluster yaml for CAPX provider type. For the sake of this bug report, lets name the CAPI cluster referred to in the label cluster.x-k8s.io/cluster-name be called as capx-cluster. For this report, keep the WMD and KCP pointing to same NutanixMachineTemplate, say capx-cluster-mt-0 and keep the worker node count and control node count as 1.
  3. Change the image name for NutanixMachineTemplate, under .spec.template.spec.image.name to a random string value such that, the image does not exist in PC.
  4. Deploy this cluster yaml on a management kind cluster. Note that status of capx-cluster after few mins. This will say that cluster is in Provisioned state. This is incorrect ! it must either be in Provisioning state or in Failed state. !!
  5. Next delete this cluster object using kubectl delete cl capx-cluster. This command will be stuck as there are finalizers set on the capx-cluster object. Now open another terminal with same cluster context as kind management cluster used earlier. Check the logs of CAPI controller manager and CAPX controller manager.
  6. CAPI controller manager will report:
I0511 23:54:36.485905       1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/capx-cluster-kcp-kj4cp" namespace="default" name="capx-cluster-kcp-kj4cp" reconcileID=861994a7-ea7e-421d-8ed5-cbb0b5f95fb1 KubeadmControlPlane="default/capx-cluster-kcp" Cluster="default/capx-cluster" Node="" cause="cluster is being deleted"
E0511 23:54:36.510267       1 controller.go:329] "Reconciler error" err="machines.cluster.x-k8s.io \"capx-cluster-kcp-kj4cp\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/capx-cluster-kcp-kj4cp" namespace="default" name="capx-cluster-kcp-kj4cp" reconcileID=861994a7-ea7e-421d-8ed5-cbb0b5f95fb1
I0511 23:54:36.548737       1 cluster_controller.go:329] "Cluster still has descendants - need to requeue" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="default/capx-cluster" namespace="default" name="capx-cluster" reconcileID=5bf051b7-537e-422c-9750-a4d148a73867 infrastructureRef="capx-cluster"
  1. CAPX cluster manager will report:
I0512 00:00:06.558316       1 nutanixcluster_controller.go:122] NutanixCluster[namespace: default, name: capx-cluster] Reconciling the NutanixCluster.
I0512 00:00:06.558407       1 nutanixcluster_controller.go:157] NutanixCluster[namespace: default, name: capx-cluster] Fetched the owner Cluster: capx-cluster
I0512 00:00:06.558616       1 nutanixcluster_controller.go:333] Credential ref is kind Secret for cluster capx-cluster
E0512 00:00:06.558636       1 nutanixcluster_controller.go:342] error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found
E0512 00:00:06.558650       1 nutanixcluster_controller.go:178] NutanixCluster[namespace: default, name: capx-cluster] error occurred while reconciling credential ref for cluster capx-cluster: error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found
I0512 00:00:06.559019       1 nutanixcluster_controller.go:172] NutanixCluster[namespace: default, name: capx-cluster] Patched NutanixCluster. Status: {Ready:true FailureDomains:map[] Conditions:[{Type:ClusterCategoryCreated Status:False Severity:Info LastTransitionTime:2023-05-11 23:54:38 +0000 UTC Reason:Deleting Message:} {Type:CredentialRefSecretOwnerSet Status:False Severity:Error LastTransitionTime:2023-05-11 23:54:38 +0000 UTC Reason:CredentialRefSecretOwnerSetFailed Message:error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found} {Type:PrismClientInit Status:True Severity: LastTransitionTime:2023-05-11 23:47:53 +0000 UTC Reason: Message:}] FailureReason:<nil> FailureMessage:<nil>}
1.6838496065590758e+09  ERROR   Reconciler error        {"controller": "nutanixcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "NutanixCluster", "NutanixCluster": {"name":"capx-cluster","namespace":"default"}, "namespace": "default", "name": "capx-cluster", "reconcileID": "93f1b33d-9269-4d15-b94a-64f2e08bcc72", "error": "error occurred while fetching cluster capx-cluster secret for credential ref: Secret \"capx-cluster\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:326
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234

Essentially it is looking for the secret object, but the object does not exist, as its deleted earlier.

What did you expect to happen:

  1. Cluster should not change to Provisioned status. It must either be in Provisioning state or Failure state.
  2. Cluster secret must not be deleted first. If it is deleted, then reconciler must omit the search of that in delete logic.

Anything else you would like to add:

None

Environment:

  • Cluster-api-provider-nutanix version: v1.2.1
  • Kubernetes version: (use kubectl version): v1.25.3
  • OS (e.g. from /etc/os-release): "CentOS Linux 7 (Core)"
@nutanix-cn-prow-bot
Copy link

@prajnutanix: The label(s) kind/bug cannot be applied, because the repository doesn't have them.

In response to this:

/kind bug

What steps did you take and what happened:

  1. Create kind cluster as a management cluster for this bug repro. Run clusterctl init --infrastructure v1.2.1 to initialize the providers. Make sure that ~/.cluster-api/clusterctl.yaml contains relevant values needed for cluster initialization, before running this command.
  2. use clusterctl to generate a cluster yaml for CAPX provider type. For the sake of this bug report, lets name the CAPI cluster referred to in the label cluster.x-k8s.io/cluster-name be called as capx-cluster. For this report, keep the WMD and KCP pointing to same NutanixMachineTemplate, say capx-cluster-mt-0 and keep the worker node count and control node count as 1.
  3. Change the image name for NutanixMachineTemplate, under .spec.template.spec.image.name to a random string value such that, the image does not exist in PC.
  4. Deploy this cluster yaml on a management kind cluster. Note that status of capx-cluster after few mins. This will say that cluster is in Provisioned state. This is incorrect ! it must either be in Provisioning state or in Failed state. !!
  5. Next delete this cluster object using kubectl delete cl capx-cluster. This command will be stuck as there are finalizers set on the capx-cluster object. Now open another terminal with same cluster context as kind management cluster used earlier. Check the logs of CAPI controller manager and CAPX controller manager.
  6. CAPI controller manager will report:
I0511 23:54:36.485905       1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/capx-cluster-kcp-kj4cp" namespace="default" name="capx-cluster-kcp-kj4cp" reconcileID=861994a7-ea7e-421d-8ed5-cbb0b5f95fb1 KubeadmControlPlane="default/capx-cluster-kcp" Cluster="default/capx-cluster" Node="" cause="cluster is being deleted"
E0511 23:54:36.510267       1 controller.go:329] "Reconciler error" err="machines.cluster.x-k8s.io \"capx-cluster-kcp-kj4cp\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/capx-cluster-kcp-kj4cp" namespace="default" name="capx-cluster-kcp-kj4cp" reconcileID=861994a7-ea7e-421d-8ed5-cbb0b5f95fb1
I0511 23:54:36.548737       1 cluster_controller.go:329] "Cluster still has descendants - need to requeue" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="default/capx-cluster" namespace="default" name="capx-cluster" reconcileID=5bf051b7-537e-422c-9750-a4d148a73867 infrastructureRef="capx-cluster"
  1. CAPX cluster manager will report:
I0512 00:00:06.558316       1 nutanixcluster_controller.go:122] NutanixCluster[namespace: default, name: capx-cluster] Reconciling the NutanixCluster.
I0512 00:00:06.558407       1 nutanixcluster_controller.go:157] NutanixCluster[namespace: default, name: capx-cluster] Fetched the owner Cluster: capx-cluster
I0512 00:00:06.558616       1 nutanixcluster_controller.go:333] Credential ref is kind Secret for cluster capx-cluster
E0512 00:00:06.558636       1 nutanixcluster_controller.go:342] error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found
E0512 00:00:06.558650       1 nutanixcluster_controller.go:178] NutanixCluster[namespace: default, name: capx-cluster] error occurred while reconciling credential ref for cluster capx-cluster: error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found
I0512 00:00:06.559019       1 nutanixcluster_controller.go:172] NutanixCluster[namespace: default, name: capx-cluster] Patched NutanixCluster. Status: {Ready:true FailureDomains:map[] Conditions:[{Type:ClusterCategoryCreated Status:False Severity:Info LastTransitionTime:2023-05-11 23:54:38 +0000 UTC Reason:Deleting Message:} {Type:CredentialRefSecretOwnerSet Status:False Severity:Error LastTransitionTime:2023-05-11 23:54:38 +0000 UTC Reason:CredentialRefSecretOwnerSetFailed Message:error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found} {Type:PrismClientInit Status:True Severity: LastTransitionTime:2023-05-11 23:47:53 +0000 UTC Reason: Message:}] FailureReason:<nil> FailureMessage:<nil>}
1.6838496065590758e+09  ERROR   Reconciler error        {"controller": "nutanixcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "NutanixCluster", "NutanixCluster": {"name":"capx-cluster","namespace":"default"}, "namespace": "default", "name": "capx-cluster", "reconcileID": "93f1b33d-9269-4d15-b94a-64f2e08bcc72", "error": "error occurred while fetching cluster capx-cluster secret for credential ref: Secret \"capx-cluster\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
       sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:326
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
       sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
       sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234

Essentially it is looking for the secret object, but the object does not exist, as its deleted earlier.

What did you expect to happen:

  1. Cluster should not change to Provisioned status. It must either be in Provisioning state or Failure state.
  2. Cluster secret must not be deleted first. If it is deleted, then reconciler must omit the search of that in delete logic.

Anything else you would like to add:

None

Environment:

  • Cluster-api-provider-nutanix version: v1.2.1
  • Kubernetes version: (use kubectl version): v1.25.3
  • OS (e.g. from /etc/os-release): "CentOS Linux 7 (Core)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant