Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup remains stuck in phase "Deleting", no additional delete requests honoured #345

Closed
rahulr3 opened this issue Apr 20, 2021 · 20 comments · Fixed by #351
Closed

Backup remains stuck in phase "Deleting", no additional delete requests honoured #345

rahulr3 opened this issue Apr 20, 2021 · 20 comments · Fixed by #351
Assignees

Comments

@rahulr3
Copy link

rahulr3 commented Apr 20, 2021

Velero Version:
Velero 1.5.4
Velero Plugin for AWS 1.1.0
Velero VSphere Plugin 1.1.0

Configured S3 storage using minio on a separate node.

Create a namespace with 150 PVs and 150 Pods with total 2400+ resources.
Create multiple backups (more than 30)

Observations

  1. Deleted one of the backups, the backup remained in phase "Deleting" forever. This happened without Velero getting restarted.
  2. Submitted additional requests for delete, but these were not picked by Velero for processing. The corresponding CR (deleteBackuprequests) showed status not updated.
  3. Is this because Velero Delete Custom Controller remains blocked on earlier backup which remained in "Deleting" phase ?
  4. Restarted Velero after which the subsequent delete requests were picked up for deletion. However, the first backup still remained in phase "Deleting".
  5. Subsequent delete backup request was served and could delete the backup.
  6. However, next backup was stuck again in phase "Deleting"
  7. Additional delete requests were not picked up by velero.
@rahulr3
Copy link
Author

rahulr3 commented Apr 20, 2021

logs.zip
Uploading velero deployment logs, the yaml for both backups which are remaining in phase "Deleting".

@xing-yang
Copy link
Contributor

@rahulr3 Please collect logs from Backup Driver. See instructions here on how to collect them:
https://github.com/vmware-tanzu/velero-plugin-for-vsphere/blob/main/docs/troubleshooting.md#generalvanilla

@xing-yang
Copy link
Contributor

CC @deepakkinni

@rahulr3
Copy link
Author

rahulr3 commented Apr 21, 2021

logs.zip
Added Backup driver logs as requested @xing-yang

@xing-yang
Copy link
Contributor

/assign @deepakkinni

@zubron
Copy link

zubron commented Apr 21, 2021

@xing-yang @deepakkinni I don't know if you have access to move the issue, but if you find that this is a vSphere plugin issue, we can transfer it to that repo if you prefer.

@deepakkinni
Copy link
Collaborator

I am taking a look at the logs, unsure of the root cause. If it ends up being a plugin issue you could transfer it.

@deepakkinni
Copy link
Collaborator

@rahulr3

@nrb mentioned that operations on velero are serialized.
I took a look at the Backup hostname-1-ns1-1618704337 deletion:
dkinni@dkinni-a02 logs-1 % grep "completed with err*" velero_1.log|grep hostname-1-ns1-1618704337|wc -l
47
I don't see any errors reported during DeleteSnapshot processing on the plugin.

Could you repro the issue as follows:
Delete a Backup, if the backup remained in "Deleting" Phase then collect backupdriver logs and velero logs(do not restart velero at this point as logs get rolled over), or repeat the process till you notice a hung delete backup request.
It would be great if you could start from a clean setup if possible.

Some additional info, if velero sees delete Backup on Backup that is already being deleted, then, the duplicate delete backup request is ignored.

@rahulr3
Copy link
Author

rahulr3 commented Apr 22, 2021 via email

@deepakkinni
Copy link
Collaborator

deepakkinni commented Apr 26, 2021

@rahulr3
Sorry for the delayed reply, I was on PTO.
I took a look at the new logs, the velero logs are missing. I did try to look at the available info:

Issue-1: Unprocessed delete snapshot CRs
All the delete snap crds mentioned in the comments suffer from the same error:

delete-e914441d-c1d5-4d0b-b4a2-e5d8a13474c7   2d17h
time="2021-04-22T01:10:01Z" level=info msg="syncDeleteSnapshotByKey: calling deleteSnapshot velero/delete-e914441d-c1d5-4d0b-b4a2-e5d8a13474c7 " controller=BackupDriverController logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:820"
time="2021-04-22T01:10:01Z" level=info msg="deleteSnapshot called with SnapshotID , Namespace: velero, Name: delete-e914441d-c1d5-4d0b-b4a2-e5d8a13474c7" controller=BackupDriverController logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:132"
time="2021-04-22T01:10:01Z" level=info msg="Calling Snapshot Manager to delete snapshot with snapshotID " controller=BackupDriverController logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:136"
time="2021-04-22T01:10:01Z" level=error msg="Fail to construct new Protected Entity ID from string " controller=BackupDriverController error="astrolabe: '' is not a valid protected entity ID" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:139"

Here the incoming snapshotID of the deleteSnapshot CR is empty, this could only occur if the Snapshot CRD stored in the pvc as annotation was incorrect. Please upload the velero logs to determine what specifically happened to these crs.

Issue-2: Random DeleteSnapshot issues:

time="2021-04-22T15:08:44Z" level=info msg="Retrieved VSLM task VslmTask-26690 to track DeleteSnapshot on IVD ivd:71bf0aff-03e7-49ae-becc-70e88709e134:4388cf40-cd6f-48cd-8ad9-5ce133c8995f" logSource="/go/pkg/mod/github.com/vmware-tanzu/[email protected]/pkg/ivd/ivd_protected_entity.go:396"
time="2021-04-22T15:08:45Z" level=info msg="Waited for 1.265985002s to retrieve Task VslmTask-26690 status." logSource="/go/pkg/mod/github.com/vmware-tanzu/[email protected]/pkg/ivd/ivd_protected_entity.go:400"
time="2021-04-22T15:08:45Z" level=info msg="Disk doesn't have given snapshot due to the snapshot stamp was removed in the previous DeleteSnapshot operation which failed with InvalidState fault. And it will be resolved by the next snapshot operation on the same VM. Will NOT retry" error=InvalidArgument logSource="/go/pkg/mod/github.com/vmware-tanzu/[email protected]/pkg/ivd/ivd_protected_entity.go:405"
time="2021-04-22T15:08:45Z" level=info msg="Step 2: Deleting the durable snapshot from s3" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:508" peID="pvc:ns1/ns1pod100pvc100:aXZkOjcxYmYwYWZmLTAzZTctNDlhZS1iZWNjLTcwZTg4NzA5ZTEzNDo0Mzg4Y2Y0MC1jZDZmLTQ4Y2QtOGFkOS01Y2UxMzNjODk5NWY"
time="2021-04-22T15:08:45Z" level=error msg="Failed to get the ProtectedEntity from peID ivd:71bf0aff-03e7-49ae-becc-70e88709e134:4388cf40-cd6f-48cd-8ad9-5ce133c8995f" error="GetObject failed for bucket velero, key plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd:71bf0aff-03e7-49ae-becc-70e88709e134:4388cf40-cd6f-48cd-8ad9-5ce133c8995f: NoSuchKey: The specified key does not exist.\n\tstatus code: 404, request id: 1678369EF55238CC, host id: " error.file="/go/pkg/mod/github.com/vmware-tanzu/[email protected]/pkg/s3repository/repository_protected_entity_type_manager.go:176" error.function="github.com/vmware-tanzu/astrolabe/pkg/s3repository.(*ProtectedEntityTypeManager).GetProtectedEntity" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:574"

Here there are 2 observation, firstly, the fcd delete snapshot is failing because it cannot find the snapshot, this is particularly odd, will most likely see this if the snapshot was already deleted. We need vcenter and esx logs to root cause.

The second observation is that the backupdriver is unable to find the specific ivd+snap on the remote repository, again this is likely because either the remote snapshot(aws s3) was previously deleted or was never created in the first place. It wouldn't be created if the DeleteBackup was invoked before the upload is complete.

Could you verify if you've waited until all the Uploads were successfully completed before invoking delete backups?
You could verify this by either looking at the Snapshot crds created as a result of Backup operation, here the expected state of all the Snapshot crds are supposed to be "Uploaded". Or, you check the Upload CR's status, again the expectation is the state is "Uploaded". You must rely on these to verify that the Backup is complete rather than the Backup CR.

So basically the required logs are:

  1. vc + esx logs
  2. velero logs
  3. backup driver logs
  4. Yaml output of the delete snapshot CRDs, Backup CRDs, DeleteBackup CRDs.

Please repro this issue to collect all the necessary logs in one go.

  1. Cleanup the setup, make sure there are no previously created delete snap crs or snap crs, a fresh setup is better for a faster triage.
  2. Invoke Backup
  3. Wait for all the Snap CRDs to be in "Uploaded" phase.
  4. Or wait for Upload CRDs to move into "Uploaded" phase, verifying Snap and Upload CRDs both of them should confirm that the backup is indeed uploaded.
  5. Trigger Backup delete, this should create a bunch of delete snapshot CRD, check if they move into "Completed" state, if they do not after waiting, collect yaml output for the crds
kubectl get deletesnapshot delete-8f3236a8-a114-49bc-a722-99b7b9bcc231 -n velero -o yaml
  1. Collect velero, backupdriver, and vcenter+esx logs. It's essential you collect them immediately after you observe the issue to avoid rollover, the time you wait should be the time it normally takes for a successful delete, there is no necessity to wait several hours before collecting the logs.
    It'll be great if the velero logs collected contain both backup and the delete backup request.
    It would also help if you stuck to a simple test, i.e, Backup followed by delete Backup, please do not try requesting duplicate delete backups.

@zubron
Could you transfer the issue to plugin? I am not sure how to do it.

@zubron zubron transferred this issue from vmware-tanzu/velero Apr 26, 2021
@deepakkinni deepakkinni assigned deepakkinni and unassigned xing-yang Apr 26, 2021
@rahulr3
Copy link
Author

rahulr3 commented Apr 28, 2021

Uploading delete_issue_0428.zip…

Please find the attached logs as requested. All logs provided except esx + vc logs as the can't be accomodated within 10 MB limit imposed by GitHub.

@deepakkinni
Copy link
Collaborator

@rahulr3
The logs attached not accessible, it's redirecting back to this issue. Please upload them again.

Anyways, based on the conversation in the email, I think I could point out a reason as to why the Backup delete may hang:

  1. During the backup, a malformed Snapshot CRD is stored as the PVC annotation.
  2. During the Backup delete we pick up this Snapshot CR from the PVC. This is used to extract the snapshot-id.
  3. A DeleteSnapshot CRD is created with the extracted snapshot-id, velero at this point waits for the DeleteSnapshot CRD to either be completed successfully or fail.
  4. Bug-1 The backup driver fails while resolving the snapshot-id, but the failure doesn't put the DeleteSnapshot CR into a "Failed" state, which means the DeleteSnapshot CR is still in the "New" phase.
  5. The backup driver continuously retries processing the DeleteSnapshot CR since it notices it's in "New" state, while velero is still waiting for the for DeleteSnapshot CR to either succeed or fail. This will cause the Backup delete to hang.

To confirm if this is what's happening I'd need to look at velero and backup driver logs for the velero Backup command to figure out what actually caused the malformed snapshot crd to be stored as pvc annotation.
Please upload them.

The other Bug-2 noticed while triaging this issue is there is some error shadowing happening in snapshot manager, but this is unrelated to what issues you are noticing, I am mentioning it here for completeness sake.

@rahulr3
Copy link
Author

rahulr3 commented Apr 29, 2021

@deepakkinni uploading the logs again
delete_issue_0428.zip

@deepakkinni
Copy link
Collaborator

Thanks @rahulr3

m2bkp1

velero logs:

time="2021-04-30T00:42:32Z" level=info msg="Listing items" backup=velero/m2bkp1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace=j2pv resource=pods
time="2021-04-30T00:42:32Z" level=info msg="Retrieved 2 items" backup=velero/m2bkp1 group=v1 logSource="pkg/backup/item_collector.go:297" namespace=j2pv resource=pods
time="2021-04-30T00:42:32Z" level=info msg="Getting items for resource" backup=velero/m2bkp1 group=v1 logSource="pkg/backup/item_collector.go:165" resource=persistentvolumeclaims
time="2021-04-30T00:42:32Z" level=info msg="Listing items" backup=velero/m2bkp1 group=v1 logSource="pkg/backup/item_collector.go:291" namespace=j2pv resource=persistentvolumeclaims
time="2021-04-30T00:42:32Z" level=info msg="Retrieved 2 items" backup=velero/m2bkp1 group=v1 logSource="pkg/backup/item_collector.go:297" namespace=j2pv resource=persistentvolumeclaims
.
.
time="2021-04-30T00:42:36Z" level=info msg="VSphere PVCBackupItemAction: StorageClass is not set for PVC j2pv/j2pvpod0pvc0" backup=velero/m2bkp1 cmd=/plugins/velero-plugin-for-vsphere logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/plugin/backup_pvc_action_plugin.go:67" pluginName=velero-plugin-for-vsphere
.
.
time="2021-04-30T00:42:56Z" level=info msg="snapshot= snap-e8539258-358f-49ff-9001-4b2d2ca85414 changed, phase = Snapshotted" backup=velero/m2bkp1 cmd=/plugins/velero-plugin-for-vsphere logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotUtils/snapshot.go:130" pluginName=velero-plugin-for-vsphere
time="2021-04-30T00:42:56Z" level=info msg="Persisting snapshot with snapshotID :pvc:j2pv/j2pvpod0pvc0:aXZkOmM4MmQwY2NmLTIwODItNGQ0NS04MmI5LTJjZWUwNWIzMTQ2ZTo1MjQ2ZWE3My1lNTA0LTRiMTgtOWRiNy1kMTNlNjcxZmE3NzI
.
.
time="2021-04-30T00:42:56Z" level=info msg="VSphere PVCBackupItemAction: StorageClass is not set for PVC j2pv/j2pvpod1pvc1" backup=velero/m2bkp1 cmd=/plugins/velero-plugin-for-vsphere logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/plugin/backup_pvc_action_plugin.go:67" pluginName=velero-plugin-for-vsphere
.
.
time="2021-04-30T00:43:02Z" level=info msg="snapshot= snap-d03cfdd4-3460-4c81-941c-8c04907ef24c changed, phase = Snapshotted" backup=velero/m2bkp1 cmd=/plugins/velero-plugin-for-vsphere logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotUtils/snapshot.go:130" pluginName=velero-plugin-for-vsphere
time="2021-04-30T00:43:02Z" level=info msg="Persisting snapshot with snapshotID :pvc:j2pv/j2pvpod1pvc1:aXZkOjgwNDRiZjNjLWU5ZGEtNDUyOC05OWFmLWM5ZTkzYTJmZWE0YTpjZmI0MDg5MC1mYTI5LTQ5MGItYWFkMy0xNWMzN2IwNWQzMWU

time="2021-04-30T00:43:03Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:642"

The backup starts at "2021-04-30T00:42:32Z", and ends at "2021-04-30T00:43:03Z", in this time frame there are 2 snapshots crated

snapshot cr name: snap-e8539258-358f-49ff-9001-4b2d2ca85414
pvc: j2pv/j2pvpod0pvc0
fcd: ivd:c82d0ccf-2082-4d45-82b9-2cee05b3146e:5246ea73-e504-4b18-9db7-d13e671fa772

snapshot cr name: snap-d03cfdd4-3460-4c81-941c-8c04907ef24c
pvc: j2pv/j2pvpod1pvc1
fcd: ivd:8044bf3c-e9da-4528-99af-c9e93a2fea4a:cfb40890-fa29-490b-aad3-15c37b05d31e

backup-driver logs:
time="2021-04-30T00:44:04Z" level=info msg="updateSnapshotStatusPhase: Snapshot j2pv/snap-e8539258-358f-49ff-9001-4b2d2ca85414 updated phase from Uploading to Uploaded" controller=BackupDriverController logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:292"
.
.
time="2021-04-30T00:44:25Z" level=info msg="updateSnapshotStatusPhase: Snapshot j2pv/snap-d03cfdd4-3460-4c81-941c-8c04907ef24c updated phase from Uploading to Uploaded" controller=BackupDriverController logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:292"

At this point, we are certain that both the snapshots are uploaded to the remote repository.

A Backup delete for m2bkp1 was initiated at "2021-04-30T00:50:20Z" but I don't see the delete Backup request starting to be processed by velero, could the velero team take a look @zubron @nrb

@ashish-amarnath
Copy link
Member

All of Velero controllers reconcile only one object at a time.
So, when the backup delete request object is not being processed, is there a backup that is already being deleted?
Also, this may be unlikely, but if Velero is started in --restore-only mode, backup deletion requests are not processed. Is this flag passed to Velero server?

@deepakkinni
Copy link
Collaborator

I believe there is another backup delete in progress. From my understanding, one of the backup delete calls the plugin, but the plugin does not return(due to a bug described in my previous comment), do you think the plugin not returning could cause velero to hang on a single request? Is there a timeout for the plugin to process the delete?

@ashish-amarnath
Copy link
Member

In the interest of better UX, Velero plugins should implement timeouts and cancellable operations in the plugin interface. In order to make backwards compatible changes to the plugin interface the issue of plugin versioning needs to be addressed. For velero V1.7 we are working on making a proposal for plugin versioning.

@ashish-amarnath
Copy link
Member

ashish-amarnath commented May 27, 2021 via email

@deepakkinni
Copy link
Collaborator

This issue should be mitigated by #351

@xing-yang
Copy link
Contributor

This is fixed by #351, closing the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants