Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring a file system backup to a different cluster failed due to Kopia snapshot not found #8019

Closed
RaniaMidaoui opened this issue Jul 16, 2024 · 15 comments

Comments

@RaniaMidaoui
Copy link

What steps did you take and what happened:

I am creating a file system backup from a particular namespace in a K8s cluster and restoring it to another cluster. But the Restore is stuck in "In Progress" and it fails after timeout (I am also backing up and restoring the Pod to which the volume is mounted, along with some Secrets and configMaps).

The backup is stored in an S3 bucket and I made sure that the same bucket is linked to the new cluster.

After investigating, I can see that for some reason, the PodVolumeRestore failed with the error:
data path restore failed: Failed to run Kopia restore: Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found

What did you expect to happen:
Restore to complete without an issue.

The following information will help us better understand what's going on:

  • The Velero pod and the node agents log erros are the following:
velero-64d44bf455-zcq96 velero  time="2024-07-15T09:05:09Z" level=info msg="Found 95 backups in the backup location that do not exist in the cluster and need to be synced" backupLocation=velero/default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:136"
 
...
 
velero-64d44bf455-zcq96 velero  time="2024-07-15T09:05:09Z" level=info msg="Attempting to sync backup into cluster" backup=school-0000-backup-20240711220015 backupLocation=velero/default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:144"
 
....
 
velero-64d44bf455-zcq96 velero  time="2024-07-15T09:07:09Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126"
time="2024-07-15T09:07:11Z" level=info msg="starting restore" logSource="pkg/controller/restore_controller.go:535" restore=velero/school-0000-restore-r6ktt
 
....
 
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="No repository found, creating one" backupLocation=default logSource="pkg/repository/ensurer.go:89" repositoryType=kopia volumeNamespace=school-0000
 
...
 
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="Initializing backup repository" backupRepo=velero/school-0000-default-kopia-8s97q logSource="pkg/controller/backup_repository_controller.go:216"

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="Set matainenance according to repository suggestion" frequency=1h0m0s logSource="pkg/controller/backup_repository_controller.go:263"

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="the managed fields for school-0000/ldap-main-0 is patched" logSource="pkg/restore/restore.go:1714" restore=velero/school-0000-restore-r6ktt
 
....
 
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:29Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found" logSource="pkg/restore/restore.go:1891" restore=velero/school-0000-restore-r6ktt
  • The BackupStorageLocation is Available

Anything else you would like to add:
Restoring the backup to the same cluster it was taken from works with no issues, this only happens when I restore to a different cluster.

Environment:

  • Velero version (use velero version):
Client:
	Version: v1.13.2
	Git commit: -
Server:
	Version: v1.13.0
  • Velero features (use velero client config get features):
    features: <NOT SET>

  • Kubernetes version (use kubectl version):

Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.8
@Lyndon-Li
Copy link
Contributor

Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found

This means that Kopia uploader could not find the snapshot in the object store location specified in the BSL. So please double check objects in the object store where Kopia repository data is stored as indicated by the BSL, and make sure the BSLs in the source cluster and dest cluster points to the same object store location.

@Lyndon-Li
Copy link
Contributor

Restore is stuck in "In Progress" and it fails after timeout

If the error is Unable to load snapshot, it should fail immediately. So please share the entire debug bundle by running velero debug, we will further troubleshoot.

@RaniaMidaoui
Copy link
Author

@Lyndon-Li Thank you for your response, here is the bundle you requested:
bundle-2024-07-17-10-46-43.tar.gz

Another update: we checked with Kopia CLI and we can't find the snapshot either, but the cluster is connected to the right backup bucket, the BackupStorageLocation is listed as Available.

@Lyndon-Li
Copy link
Contributor

we checked with Kopia CLI and we can't find the snapshot either

Since you have connected to the kopia repo, could you run kopia repo status kopia snapshot list --all kopia content stats, and share the outputs?

@RaniaMidaoui
Copy link
Author

RaniaMidaoui commented Jul 17, 2024

@Lyndon-Li sure.

[email protected]:~ $ kopia snapshot list --all  
[email protected]:~ $ kopia repo status
Config file:         /Users/rania.midaoui/Library/Application Support/kopia/repository.config

Description:         Repository in S3: <our_url>
Hostname:            mbp-rania-midaoui
Username:            rania.midaoui
Read-only:           false
Format blob cache:   15m0s

Storage type:        s3
Storage capacity:    unbounded
Storage config:      {
                       "bucket": "de-instncs-0001-backup",
                       "prefix": "kopia/school-0031/",
                       "endpoint": "<endpoint>",
                       "accessKeyID": "<our_access_id>",
                       "secretAccessKey": "****************************************",
                       "sessionToken": ""
                     }

Unique ID:           <UID>
Hash:                <HASH>
Encryption:          AES256-GCM-HMAC-SHA256
Splitter:            DYNAMIC-4M-BUZHASH
Format version:      3
Content compression: true
Password changes:    true
Max pack length:     21 MB
Index Format:        v2

Epoch Manager:       enabled
Current Epoch: 0

Epoch refresh frequency: 20m0s
Epoch advance on:        20 blobs or 10.5 MB, minimum 24h0m0s
Epoch cleanup margin:    4h0m0s
Epoch checkpoint every:  7 epochs
[email protected]:~ $ kopia content stats
Count: 1
Total Bytes: 276 B
Average: 276 B
Histogram:

        0 between 0 B and 10 B (total 0 B)
        0 between 10 B and 100 B (total 0 B)
        1 between 100 B and 1 KB (total 304 B)
        0 between 1 KB and 10 KB (total 0 B)
        0 between 10 KB and 100 KB (total 0 B)
        0 between 100 KB and 1 MB (total 0 B)
        0 between 1 MB and 10 MB (total 0 B)
        0 between 10 MB and 100 MB (total 0 B)
[email protected]:~ $ 

@Lyndon-Li
Copy link
Contributor

From the above output, the repo is empty.
If the restore in the source cluster works well, which means the repo data is there, most probably, you are referring to the wrong location in the target cluster.

@RaniaMidaoui
Copy link
Author

RaniaMidaoui commented Jul 17, 2024

@Lyndon-Li I retried with a new backup, made sure to connect the right bucket to the cluster where I restore, I verified the BackupStorageLocation, its the same as the other cluster and it says its available. Even when I run velero backup get I get the right backups.
The error is still the same.

And another thing, when I connect to the bucket and list Kopia snapshots, I still don't find anything, its empty.

@Lyndon-Li
Copy link
Contributor

when I connect to the bucket and list Kopia snapshots, I still don't find anything, its empty.

What do you see in this bucket? Do you see a kopia prefix? If so, what do you see under the kopia prefix?

@RaniaMidaoui
Copy link
Author

@Lyndon-Li Deleting all the contents of the backup bucket solved the issue, but that is not a good solution, just a temporary fix to keep implementing. We cannot do this in a production environment.
I don't know what exactly changed when we deleted the bucket contents, we didn't change anything else. Any ideas why this happened?

There is another error complaining about sync, similar to this one in this issue: kopia/kopia#1938
I don't know if it is related.

Any suggestions to what might have happened or how to actually fix the issue from your side?

@Lyndon-Li
Copy link
Contributor

I don't think it is related to kopia issue 1938, because there is no error in the log you shared.
From the log you shared, the connection just succeeded but there is no data in the repo as if the repo was newly created in the target.

Therefore, I do need some more info to get what was happening, e.g., the questions I asked in #8019 (comment)

@RaniaMidaoui
Copy link
Author

RaniaMidaoui commented Jul 26, 2024

@Lyndon-Li sure, I can see a Kopia folder inside the bucket, inside it there was only some files that start with _log_* , it seems to be log files.

@wkloucek
Copy link

wkloucek commented Aug 7, 2024

We encountered the snapshot not found error again last week, but when debugging in the beginning of this week, we couldn't reproduce it.

Maybe a short word about what we're doing: we're heavily switching between clusters for our backup / restore process development. Means we have a source-cluster where velero is running and creating backups and a target cluster where we do restores via velero. The S3 bucket is only accessible by one velero installation at a time (we can guarantee this because we use aws s3api put-bucket-policy with only one unique Principal). Maybe we trigger some weird caching effects during this switching back-and-forth.

We'll be still be attentive if it occurs another time.

What we've learnt during the debugging:

  • The snapshot ID referenced by Velero is actually the manifest ID of the Kopia snapshot, that can only been seen in the listing when you include this flag: kopia snapshot list --manifest-id --all

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Aug 7, 2024

Answer for all the related problems:

  • If you can see kopia repo data, e.g., when running kopia snapshot list --all you do see snapshots; or when running kopia content stats you see the repo is not empty, if may be a switch over problem. However, we don't expect this would happen and there is no known issue about this. Please collect the log bundle by running velero debug on both site when the problem happens so that we can further troubleshoot.
  • If you cannot see any data in the kopia repo, it may indicates that the source and target sites are not referring to the same object store location or your object store location has been destroyed. You need to double check your env and see what happened.

Copy link

github-actions bot commented Oct 7, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@github-actions github-actions bot added the staled label Oct 7, 2024
Copy link

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants