Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero fails to restore statefulsets.apps #4782

Closed
son-la opened this issue Mar 28, 2022 · 14 comments · Fixed by #5247
Closed

Velero fails to restore statefulsets.apps #4782

son-la opened this issue Mar 28, 2022 · 14 comments · Fixed by #5247
Assignees
Labels
Helm Issues related to Helm charts Needs info Waiting for information staled

Comments

@son-la
Copy link

son-la commented Mar 28, 2022

What steps did you take and what happened:

  • Create a backup including a statefulset
  • Verify backup is created successfully with velero describe --details: Statefulset object is in the resource list
  • Verify statefulset object exists in S3 folder
  • Create restore from backup.
  • All other resources are restored, except for the statefulset object. Even all of the pods are restored succesfully, but they are not grouped in the statefulset object. An error message is seen velero restore describe
    image

What did you expect to happen:
Statefulset object is restored successfully
The following information will help us better understand what's going on:

Restore log seems normal. There's only one statefulset object and it is said to restore successfully: https://gist.github.com/son-la/f02d546f9e0d68cfdc9f4bfef279f480

Anything else you would like to add:
The restoration did happen successfully for other stateful sets but fails for this stateful set. I tried to spot the difference between the working on and the not working one but nothing special I can find. The error message here is too cryptic for me to know where to look next

Environment:

  • Velero version (use velero version):
    Client:
    Version: v1.8.1
    Git commit: 18ee078
    Server:
    Version: v1.8.1

  • Velero features (use velero client config get features): No

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.11", GitCommit:"27522a29febbcc4badac257763044d0d90c11abd", GitTreeState:"clean", BuildDate:"2021-09-15T19:16:25Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes installer & version: Rancher RKE
  • Cloud provider or hardware configuration: VMWare
  • OS (e.g. from /etc/os-release): RHEL8.4

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@reasonerjt
Copy link
Contributor

@son-la
There should be some more error in velero's log to help us understand where the nil pointer error is thrown.

Please reproduce the issue, use velero debug to generate the log bundle and attach to the issue.

@reasonerjt reasonerjt added the Needs info Waiting for information label Mar 31, 2022
@son-la
Copy link
Author

son-la commented Mar 31, 2022

velero-bundle.zip
Thanks for the reply. I generate velero-bundle specifically for that failed restore.

I search and replace some sensitive information out (S3 endpoint, bucket name,) and zip it again. Otherwise, everything should be there.

In this bundle, the failed restoration is zeebe-restore
image

@reasonerjt reasonerjt self-assigned this Apr 1, 2022
@reasonerjt reasonerjt added Needs investigation and removed Needs info Waiting for information labels Apr 1, 2022
@reasonerjt
Copy link
Contributor

Found such messages in velero's log:

time="2022-03-31T09:02:41Z" level=info msg="Executing ChangeStorageClassAction" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:68" pluginName=velero restore=velero/zeebe-restore
time="2022-03-31T09:02:41Z" level=debug msg="Getting plugin config" cmd=/velero logSource="pkg/restore/change_storageclass_action.go:71" pluginName=velero restore=velero/zeebe-restore
time="2022-03-31T09:02:41Z" level=info msg="Done executing ChangeStorageClassAction" cmd=/velero logSource="/usr/local/go/src/runtime/panic.go:1038" pluginName=velero restore=velero/zeebe-restore

Did you setup a configmap to change the storage class, and does it have data field?

https://velero.io/docs/v1.8/restore-reference/#changing-pvpvc-storage-classes

@reasonerjt reasonerjt added Needs info Waiting for information and removed Needs investigation labels Apr 5, 2022
@son-la
Copy link
Author

son-la commented Apr 6, 2022

Yes, there's a configmap change from src -> dst cluster. This extra config map is deployed when installing velero to the dst cluster

configMaps:
  change-storage-class-config:
    labels:
      velero.io/plugin-config: ""  
      velero.io/change-storage-class: RestoreItemAction
    data:
      vmware-volume: vmware

@reasonerjt
Copy link
Contributor

@son-la
Thanks for the reply.
I think based on the logs, the nil pointer happens in this func:

func (a *ChangeStorageClassAction) Execute(input *velero.RestoreItemActionExecuteInput) (*velero.RestoreItemActionExecuteOutput, error) {

Unfortunately, the log does not contain the stack trace. I was thinking it happened in this line:

if config == nil || len(config.Data) == 0 {

Is the snippet from the output of kubectl? Based on the indent the data should not be at the same level as labels. Could you double check?

If this is not the problem, the best thing we can do next step is to add more log in the func to find where the nil pointer is thrown.

@son-la
Copy link
Author

son-la commented Apr 7, 2022

Thanks for the troubleshooting effort.

The snippet is actually in the helm values.yml file when installing velero.
image

So does it mean that if I'm able to keep the storage class name to be the same, statefulsets.apps can be restored successfully?

@reasonerjt reasonerjt added the Helm Issues related to Helm charts label Apr 8, 2022
@reasonerjt
Copy link
Contributor

I don't use helm chart very much.

If you use kubectl to check the actual configmap, it will give us a better understanding how it looks.

If you choose not to change the storageclass, you will probably not hit the same nil pointer issue.

@son-la
Copy link
Author

son-la commented Apr 13, 2022

Thanks for the answer. Here's the configmap created by helm installation

➜ kubectl get configmap/velero-change-storage-class-config -n velero -o yaml
apiVersion: v1
data:
  vmware-volume: vmware
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: velero
    meta.helm.sh/release-namespace: velero
  creationTimestamp: "2022-04-13T07:03:39Z"
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    helm.sh/chart: velero-2.29.4
    velero.io/change-storage-class: RestoreItemAction
    velero.io/plugin-config: ""
  name: velero-change-storage-class-config
  namespace: velero
  resourceVersion: "387275"
  uid: 4c4e7501-b64f-40e9-afaa-a79615bdbd87

@stale
Copy link

stale bot commented Jun 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Jun 12, 2022
@stale
Copy link

stale bot commented Jun 27, 2022

Closing the stale issue.

@stale stale bot closed this as completed Jun 27, 2022
reasonerjt added a commit to reasonerjt/velero that referenced this issue Jul 10, 2022
Mitigate the issue mentioned in vmware-tanzu#4782
When there's a bug or misconfiguration that causes nil pointer there
will be more stack trace information to help us debug.

Signed-off-by: Daniel Jiang <[email protected]>
reasonerjt added a commit to reasonerjt/velero that referenced this issue Jul 10, 2022
Mitigate the issue mentioned in vmware-tanzu#4782
When there's a bug or misconfiguration that causes nil pointer there
will be more stack trace information to help us debug.

Signed-off-by: Daniel Jiang <[email protected]>
@UristMcMiner
Copy link

Hi, this issue still persists in 1.9.0, and it is a crucial feature to migrate applications to another storage.

@stefanrepl
Copy link

stefanrepl commented Aug 16, 2022

I am also running into this. I have ran it after the merge of the stack trace for panic and see this error log when trying to restore my StatefulSet:
time="2022-08-10T21:55:54Z" level=error msg="Namespace default, resource restore error: error preparing statefulsets.apps/default/kotsadm-postgres: rpc error: code = Aborted desc = plugin panicked: runtime error: invalid memory address or nil pointer dereference" logSource="pkg/controller/restore_controller.go:504" restore=velero/instance-fwkh5.kotsadm

I am going through the same steps posted from OP, and have confirmed my StatefulSet is present in my backup that is created.

For what it is worth, I do not see this issue when testing with version 1.7.1 and I do see the issue in versions 1.8.1, and 1.9.0, that is all I have tested so far.

Can we reopen this issue? @reasonerjt

@divolgin
Copy link
Contributor

I have confirmed that deleting this config map allows the restore to complete successfully: change-storage-class-config. The problem must be in the ChangeStorageClassAction as pointed out earlier.

Tested with 1.9.0 and 1.9.1

@reasonerjt Please re-open this issue.

@sseago
Copy link
Collaborator

sseago commented Aug 25, 2022

Reopening because it was never resolved (closed by stale bot) and theree's now a PR submitted to fix the issue.

danfengliu pushed a commit to danfengliu/velero that referenced this issue Sep 13, 2022
Mitigate the issue mentioned in vmware-tanzu#4782
When there's a bug or misconfiguration that causes nil pointer there
will be more stack trace information to help us debug.

Signed-off-by: Daniel Jiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Helm Issues related to Helm charts Needs info Waiting for information staled
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants