-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VolumeDelete may fail if Snapshots for volume not deleted first #252
Comments
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
We implemented the SPEC behavior by returning the It seems the current spec definition will leave dangling volume and snapshots, and let the status of k8s and storage systems out of sync. We are using k8s 1.17 + external provisioner 1.5 Here is an example log to show this behavor:
|
The SPEC seems to be at odds with the implementation. |
I wonder if we should add a new SNAPSHOTS_DEPEND_ON_VOLUME controller capability. If it is true, the snapshot controller can add a finalizer on the source PVC when taking a snapshot and remove it after the last snapshot is deleted. There's already a finalizer to protect the source PVC currently, but it is only added while the snapshot is being created and removed when snapshot is ready to use. Maybe we can extend the life of that finalizer if SNAPSHOTS_DEPEND_ON_VOLUME is true. |
/remove-lifecycle frozen |
I think |
Correct me if the following is misunderstood. The manner to recover (roll-back) a PVC to a snapshot of the same, would be to delete the PVC, and recreate it with a DataSource pointing to the snapshot from which to recover. If we add the said finalizers, the PVC will not be deleted till all snapshots against the PVC are deleted. This leaves behind no snapshot to revert to in the above workflow, or inability to re-create the PVC as it is not deleted till the snapshots are deleted. Also, in my testing the PVC is deleted, but the PV remains. The PV deletion is the one that is attempted multiple times (exponentially backing off) till it can be deleted. This hence currently frees up the PVC re-creation using the same name@namespace but pointing to a snapshot as the DataSource. If we are attempting to deal with graceful attempts, when pre-conditions are met, then the said finalizers on the PV can be placed, and the PV deletion not attempted till these are removed. This will reduce the unnecessary PV delete calls and possibly is what is needed to address this issue. |
@ShyamsundarR the way most enterprise-class external storage arrays work with snapshots gives them the ability to restore the snapshot directly over the existing volume without having to delete it the volume first. |
Fair enough, so a rollback of the volume to a snapshot. As I work with Ceph, it also has similar rollback options for block device (rbd) based volumes/images, hence I understand the concept. But, how can this rollback be instrumented via CSI and kubernetes though? The snapshot creation is instrumented via kubernetes and CSI, the rollback, IMHO, should be instrumented via similar means in the orchestration plane as a result. This also can help ensure the current volume, being rolled back, is not in active use by any workload, and a recovery (or rollback) can be initiated safely. There currently are no schemes for the same afaict, unless this is done using storage provider external interfaces. Is the future direction in this regard changing? Also, there would be storage providers that do not support such a rollback, the scheme in such cases should cover the ability to delete and recreate the volume from a snapshot IMO. So we may need to ensure the solution caters to both use cases. |
@xing-yang I think this is a good idea! . So that, we give flexibility to the storage vendors to operate the mode they want. As per my understanding the snap/clone/promotion of the snap..etc depends on the storage backend in place and they all implement it their own way. Some keep references , some untangle it completely from the source volume..etc. By keeping a way for the driver to advertise the relation to parent volume we are opening it up and imo ,thats the best take compared to going ahead with some restricted path. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Update 3.0 changelog
adb3af9df Merge pull request kubernetes-csi#252 from bells17/update-go-version b82ee3888 Merge pull request kubernetes-csi#253 from bells17/fix-typo c31745621 Fix typo 0a7850561 Bump to Go 1.22.3 edd89ad58 Merge pull request kubernetes-csi#251 from jsafrane/add-logcheck 043fd0991 Add test-logcheck target d7535ae0c Merge pull request kubernetes-csi#250 from jsafrane/go-1.22 b52e7ad35 Update go to 1.22.2 14fdb6f66 Merge pull request kubernetes-csi#247 from msau42/prow 9b4352e9f Update release playbook c7bb972cc Fix release notes script to use fixed tags 463a0e9f5 Add script to update specific go modules git-subtree-dir: release-tools git-subtree-split: adb3af9dfa3ed4d1a922cd839bb48e0b73918617
f40f0ccd4 Merge pull request kubernetes-csi#256 from solumath/master cfa92106c Instruction update 379a1bb9b Merge pull request kubernetes-csi#255 from humblec/sidecar-md a5667bbbb fix typo in sidecar release process 49676850e Merge pull request kubernetes-csi#254 from bells17/add-github-actions d9bd160c2 Update skip list in codespell GitHub Action adb3af9df Merge pull request kubernetes-csi#252 from bells17/update-go-version f5aebfc9f Add GitHub Actions workflows b82ee3888 Merge pull request kubernetes-csi#253 from bells17/fix-typo c31745621 Fix typo 0a7850561 Bump to Go 1.22.3 edd89ad58 Merge pull request kubernetes-csi#251 from jsafrane/add-logcheck 043fd0991 Add test-logcheck target d7535ae0c Merge pull request kubernetes-csi#250 from jsafrane/go-1.22 b52e7ad35 Update go to 1.22.2 14fdb6f66 Merge pull request kubernetes-csi#247 from msau42/prow 9b4352e9f Update release playbook c7bb972cc Fix release notes script to use fixed tags 463a0e9f5 Add script to update specific go modules git-subtree-dir: release-tools git-subtree-split: f40f0ccd458f2d4555e3ca98d69b5a984bae0f14
Per container-storage-interface/spec#347, CSI permits a storage system to not delete a volume if it has outstanding snapshots and the storage system does not treat snapshots as independent entities (this is not recommended, but the spec permits it).
We should figure out how we handle this. Is there some programmatic recovery we should implement? If that is not possible, let's make sure the correct errors are surfaced to users to allow them to handle this, and document it somewhere.
The text was updated successfully, but these errors were encountered: