-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup remains stuck in phase "Deleting", no additional delete requests honoured #345
Comments
logs.zip |
@rahulr3 Please collect logs from Backup Driver. See instructions here on how to collect them: |
CC @deepakkinni |
logs.zip |
/assign @deepakkinni |
@xing-yang @deepakkinni I don't know if you have access to move the issue, but if you find that this is a vSphere plugin issue, we can transfer it to that repo if you prefer. |
I am taking a look at the logs, unsure of the root cause. If it ends up being a plugin issue you could transfer it. |
@nrb mentioned that operations on velero are serialized. Could you repro the issue as follows: Some additional info, if velero sees delete Backup on Backup that is already being deleted, then, the duplicate delete backup request is ignored. |
@deepakkinni <https://github.com/deepakkinni>
I restarted velero only once to observe if velero starts processing delete
requests. After restarting, it could delete one backup and the next backup
remained in phase "Deleting". The logs provided earlier shouldn't have got
rolled over.
Update: Submitted the request but Velero doesn't honor it, Backup phase remains in Completed state and no additional deleteSnapshot CR being created, confirms this understanding.
> Some additional info, if velero sees delete Backup on Backup that is
already being deleted, then, the duplicate delete backup request is ignored.
Isn't this a problem?
I am trying to reproduce the issue and upload the logs. How long is enough
to conclude that delete is hung ? 1 hour ?
Update: Even after 8 hours, the CR status is not updated, the Backup is not moved to Deleting phase. But i see deleteSnasphot CR getting created.
Additional Observations, I think deletesnapshot CRs, once successfully processed are removed afer 24 hours ? But i see some CRs are still present which are some days old. All of them are in phase "New".
[root@dummy-host ~]# kubectl get deletesnapshot -n velero
NAME AGE
delete-8f3236a8-a114-49bc-a722-99b7b9bcc231 4d18h
delete-e914441d-c1d5-4d0b-b4a2-e5d8a13474c7 2d17h
delete-f069454c-adf0-45d1-b235-be09e9185948 26h
delete-f2a92d8f-66d7-4922-86dd-638788a37a9a 2d15h
Logs include backup details which is getting deleted. @deepakkinni
Latest Logs:
[logs_0422.zip](https://github.com/vmware-tanzu/velero/files/6359069/logs_0422.zip)
Thanks,
Rahul
|
@rahulr3 Issue-1: Unprocessed delete snapshot CRs
Here the incoming snapshotID of the deleteSnapshot CR is empty, this could only occur if the Snapshot CRD stored in the pvc as annotation was incorrect. Please upload the velero logs to determine what specifically happened to these crs. Issue-2: Random DeleteSnapshot issues:
Here there are 2 observation, firstly, the fcd delete snapshot is failing because it cannot find the snapshot, this is particularly odd, will most likely see this if the snapshot was already deleted. We need vcenter and esx logs to root cause. The second observation is that the backupdriver is unable to find the specific ivd+snap on the remote repository, again this is likely because either the remote snapshot(aws s3) was previously deleted or was never created in the first place. It wouldn't be created if the DeleteBackup was invoked before the upload is complete. Could you verify if you've waited until all the Uploads were successfully completed before invoking delete backups? So basically the required logs are:
Please repro this issue to collect all the necessary logs in one go.
@zubron |
Uploading delete_issue_0428.zip… Please find the attached logs as requested. All logs provided except esx + vc logs as the can't be accomodated within 10 MB limit imposed by GitHub. |
@rahulr3 Anyways, based on the conversation in the email, I think I could point out a reason as to why the Backup delete may hang:
To confirm if this is what's happening I'd need to look at velero and backup driver logs for the velero Backup command to figure out what actually caused the malformed snapshot crd to be stored as pvc annotation. The other |
@deepakkinni uploading the logs again |
Thanks @rahulr3 m2bkp1 velero logs: time="2021-04-30T00:42:32Z" level=info msg="Listing items" backup=velero/m2bkp1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace=j2pv resource=pods time="2021-04-30T00:43:03Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:642" The backup starts at "2021-04-30T00:42:32Z", and ends at "2021-04-30T00:43:03Z", in this time frame there are 2 snapshots crated
backup-driver logs:
|
All of Velero controllers reconcile only one object at a time. |
I believe there is another backup delete in progress. From my understanding, one of the backup delete calls the plugin, but the plugin does not return(due to a bug described in my previous comment), do you think the plugin not returning could cause velero to hang on a single request? Is there a timeout for the plugin to process the delete? |
In the interest of better UX, Velero plugins should implement timeouts and cancellable operations in the plugin interface. In order to make backwards compatible changes to the plugin interface the issue of plugin versioning needs to be addressed. For velero V1.7 we are working on making a proposal for plugin versioning. |
The plugin not returning can definitely cause this. There is no timeout or cancellation in too the requests. That’s definitely something worth adding.
…________________________________
From: Deepak Kinni ***@***.***>
Sent: Tuesday, May 25, 2021 7:15:56 PM
To: vmware-tanzu/velero-plugin-for-vsphere ***@***.***>
Cc: Ashish Amarnath ***@***.***>; Comment ***@***.***>
Subject: Re: [vmware-tanzu/velero-plugin-for-vsphere] Backup remains stuck in phase "Deleting", no additional delete requests honoured (#345)
I believe there is another backup delete in progress. From my understanding, one of the backup delete calls the plugin, but the plugin does not return(due to a bug described in my previous comment), do you think the plugin not returning could cause velero to hang on a single request? Is there a timeout for the plugin to process the delete?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fvmware-tanzu%2Fvelero-plugin-for-vsphere%2Fissues%2F345%23issuecomment-848404205&data=04%7C01%7Cashisham%40vmware.com%7C82eb03d9c1a94f60c1d208d91fec32c9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637575921613054271%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LkXeSUnJupJDzxz7UtL%2FwT48awnl6z9Q46YY4vNRwKc%3D&reserved=0>, or unsubscribe<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFOQASXU2XJJXVD3DRN3OHLTPRKVZANCNFSM43T3PNSQ&data=04%7C01%7Cashisham%40vmware.com%7C82eb03d9c1a94f60c1d208d91fec32c9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637575921613064267%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qH40pAijxXwIBUtd9jiciC5fYTELOhel%2FLq6S1WMZXc%3D&reserved=0>.
|
This issue should be mitigated by #351 |
This is fixed by #351, closing the bug. |
Velero Version:
Velero 1.5.4
Velero Plugin for AWS 1.1.0
Velero VSphere Plugin 1.1.0
Configured S3 storage using minio on a separate node.
Create a namespace with 150 PVs and 150 Pods with total 2400+ resources.
Create multiple backups (more than 30)
Observations
The text was updated successfully, but these errors were encountered: