-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAPZ stays stuck in deleting mode when resources are all actually gone #4570
Comments
I'd be surprised if this were a general problem since I don't remember seeing this ever in e2e. Can you provide any more details about what the cluster template looked like or if you were doing anything to it around the time it was being deleted? It would also be helpful to know what the capz-controller-manager was logging for the CAPZ resources while they were stuck and what the full YAML of the resources was then. |
I think the errors were like this when it was in this state. These are the logs with the state of an AKS cluster not being able to be created. ASO logs
CAPZ logs
|
Unfortunately, I can't reproduce the problem right now - but will circle back when/if I do. |
It almost seems like there's something wonky in your Workload ID setup or your sub or something. I've never seen this kind of error in e2e or locally for me. |
I have a live repo of this now. I'm pretty sure you can reproduce this by doing the following:
Here's my logs I have right now from ASO (using 1.14.0 release also).
|
#4609 May or may not be connected to this issue. But since both of these issues are around deletion, I am mentioning the other issue as well in here. |
Trying to fix this now and followed the steps to reproduce. I'm getting a slightly different error:
I think that might be because I didn't wait for the Azure resource to fully delete before restarting the management cluster. This should be caused by the same issue though. I think the last time when we reproduced this together, we also started a delete locally before doing so on Azure. |
Yes, we did start the delete before powering the cluster off. So between step 1 and 2 should be added "delete the CAPZ cluster object" and don't let it finish deletion. Either case IMO should be handled in some way though. If you don't initiate delete the CAPZ cluster object and manually delete the cluster while the management cluster is off, CAPZ should re-create the cluster as it has in its definition when it powers back on... |
Just confirmed that the cluster does delete after powering off and on the management cluster, even in the case where the replicas is set to 0. The cluster deletion just takes a while (~25 minutes). This can be tracked in #4339 /close |
@willie-yao: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What steps did you take and what happened:
Ran the command to delete an existing deployed cluster and it stayed permanently stuck in the deleting state. I then manually deleted the cluster resources created by CAPZ and CAPZ still stayed stuck in the deleting state.
What did you expect to happen:
CAPZ would detect that the resources it were trying to delete were already removed and get rid of the definition present which was trying to delete what is clearly already gone.
Environment:
management cluster
kubectl version
): 1.29.1/etc/os-release
): wsl ubuntu 22.04 docker desktop K8sThe text was updated successfully, but these errors were encountered: