Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Edge cases for already exists resources #5223

Closed
shubham-pampattiwar opened this issue Aug 16, 2022 · 6 comments · Fixed by #5239
Closed

Fix Edge cases for already exists resources #5223

shubham-pampattiwar opened this issue Aug 16, 2022 · 6 comments · Fixed by #5239
Assignees
Labels
Needs info Waiting for information

Comments

@shubham-pampattiwar
Copy link
Collaborator

What steps did you take and what happened:
[A clear and concise description of what the bug is, and what commands you ran.)
On OpenShift 4.11 (Kubernetes 1.24) we are seeing an error(not warning) related to the restoration of volumesnapshotclass that it already exists, not reproducilbe on earlier versions. Restore fails with the following logs, This is added as an error by the restore controller because the create call itself is incomplete due to admission webhooks validation, had the create call executed and Velero had received an IsalreadyExists this would have been a warning.

time="2022-07-14T14:34:44Z" level=error msg="error restoring example-snapclass: admission webhook \"volumesnapshotclasses.snapshot.storage.k8s.io\" denied the request: default snapshot class: example-snapclass already exits for driver: pd.csi.storage.gke.io" logSource="pkg/restore/restore.go:1357" restore=openshift-adp/mysql-42edcb0a-0381-11ed-9e9d-902e163f806c
time="2022-07-14T14:34:45Z" level=error msg="Cluster resource restore error: error restoring volumesnapshotclasses.snapshot.storage.k8s.io/example-snapclass: admission webhook \"volumesnapshotclasses.snapshot.storage.k8s.io\" denied the request: default snapshot class: example-snapclass already exits for driver: pd.csi.storage.gke.io" logSource="pkg/controller/restore_controller.go:500" restore=openshift-adp/mysql-42edcb0a-0381-11ed-9e9d-902e163f806c

What did you expect to happen:
Restore should complete without errors.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@shubham-pampattiwar shubham-pampattiwar self-assigned this Aug 16, 2022
@reasonerjt
Copy link
Contributor

reasonerjt commented Aug 22, 2022

@shubham-pampattiwar

by not reproducilbe on earlier versions do you mean not reproducible on earlier versions of openshift?
If this was caused by the addmisionwebhook, what do you think we can fix in velero? If we always check the existence of resources upon seeing ANY error, it will cause other problems.

Would we suggest the maintainer of this admissionwebhook return a more explicit 409?

@reasonerjt reasonerjt added the Needs info Waiting for information label Aug 22, 2022
@sseago
Copy link
Collaborator

sseago commented Aug 22, 2022

@reasonerjt Not reproducible in that the specific conditions that triggered the error for this particular resource type aren't there in earlier versions. But the problem is a more generic one. I know I've seen this manifest itself in the past when working on unrelated code which checked for IsAlreadyExistError as a means of determining whether something existed before.

What other problems are you referring to? We have tested a fix internally that should resolve this which I think @shubham-pampattiwar will be submitting as a PR soon. Basically if the Create fails with an error other than IsAlreadyExistsError, we do a Get call to see if it actually exists. If it does, treat it as IsAlreadyExistsError. If not, add the error to the restore like we currently do. What problems are you seeing with this approach?

@shubham-pampattiwar
Copy link
Collaborator Author

@sseago +1, @reasonerjt please take a look at this proposed PR as a fix for this issue #5239

@blackpiglet
Copy link
Contributor

Which adminssion webhook triggered this error?
If the webhook is maintained by ourself, it's better to return a more self-explained error message or error code.

@shubham-pampattiwar
Copy link
Collaborator Author

@blackpiglet This was triggered by volumesnapshotclass admission webhook I believe.

@sseago
Copy link
Collaborator

sseago commented Aug 23, 2022

@blackpiglet The issue is that the admission webhook validation happens before the normal create call, which means we don't get an IsAlreadyExists error. In this case, the resource already exists, so we don't want a more self-explained error message. We don't want the error message at all. We want to recognize that the resource is already in the cluster (just as we do when we get an IsAlreadyExists error), and then we can apply our existing resource policy and either patch it or warn (if the object is different than the obj from backup), or do nothing (if the object is the same as in the backup).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs info Waiting for information
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants