Restoring a file system backup to a different cluster failed due to Kopia snapshot not found #8019

RaniaMidaoui · 2024-07-16T16:49:15Z

What steps did you take and what happened:

I am creating a file system backup from a particular namespace in a K8s cluster and restoring it to another cluster. But the Restore is stuck in "In Progress" and it fails after timeout (I am also backing up and restoring the Pod to which the volume is mounted, along with some Secrets and configMaps).

The backup is stored in an S3 bucket and I made sure that the same bucket is linked to the new cluster.

After investigating, I can see that for some reason, the PodVolumeRestore failed with the error:
data path restore failed: Failed to run Kopia restore: Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found

What did you expect to happen:
Restore to complete without an issue.

The following information will help us better understand what's going on:

The Velero pod and the node agents log erros are the following:

velero-64d44bf455-zcq96 velero  time="2024-07-15T09:05:09Z" level=info msg="Found 95 backups in the backup location that do not exist in the cluster and need to be synced" backupLocation=velero/default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:136"
 
...
 
velero-64d44bf455-zcq96 velero  time="2024-07-15T09:05:09Z" level=info msg="Attempting to sync backup into cluster" backup=school-0000-backup-20240711220015 backupLocation=velero/default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:144"
 
....
 
velero-64d44bf455-zcq96 velero  time="2024-07-15T09:07:09Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126"
time="2024-07-15T09:07:11Z" level=info msg="starting restore" logSource="pkg/controller/restore_controller.go:535" restore=velero/school-0000-restore-r6ktt
 
....
 
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="No repository found, creating one" backupLocation=default logSource="pkg/repository/ensurer.go:89" repositoryType=kopia volumeNamespace=school-0000
 
...
 
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="Initializing backup repository" backupRepo=velero/school-0000-default-kopia-8s97q logSource="pkg/controller/backup_repository_controller.go:216"

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="Set matainenance according to repository suggestion" frequency=1h0m0s logSource="pkg/controller/backup_repository_controller.go:263"

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="the managed fields for school-0000/ldap-main-0 is patched" logSource="pkg/restore/restore.go:1714" restore=velero/school-0000-restore-r6ktt
 
....
 
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:29Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found" logSource="pkg/restore/restore.go:1891" restore=velero/school-0000-restore-r6ktt

The BackupStorageLocation is Available

Anything else you would like to add:
Restoring the backup to the same cluster it was taken from works with no issues, this only happens when I restore to a different cluster.

Environment:

Velero version (use velero version):

Client:
	Version: v1.13.2
	Git commit: -
Server:
	Version: v1.13.0

Velero features (use velero client config get features):
features: <NOT SET>
Kubernetes version (use kubectl version):

Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.8

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-07-17T03:05:51Z

Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found

This means that Kopia uploader could not find the snapshot in the object store location specified in the BSL. So please double check objects in the object store where Kopia repository data is stored as indicated by the BSL, and make sure the BSLs in the source cluster and dest cluster points to the same object store location.

Lyndon-Li · 2024-07-17T03:07:06Z

Restore is stuck in "In Progress" and it fails after timeout

If the error is Unable to load snapshot, it should fail immediately. So please share the entire debug bundle by running velero debug, we will further troubleshoot.

RaniaMidaoui · 2024-07-17T09:45:29Z

@Lyndon-Li Thank you for your response, here is the bundle you requested:
bundle-2024-07-17-10-46-43.tar.gz

Another update: we checked with Kopia CLI and we can't find the snapshot either, but the cluster is connected to the right backup bucket, the BackupStorageLocation is listed as Available.

Lyndon-Li · 2024-07-17T10:03:12Z

we checked with Kopia CLI and we can't find the snapshot either

Since you have connected to the kopia repo, could you run kopia repo status kopia snapshot list --all kopia content stats, and share the outputs?

RaniaMidaoui · 2024-07-17T10:44:00Z

@Lyndon-Li sure.

[email protected]:~ $ kopia snapshot list --all

[email protected]:~ $ kopia repo status
Config file:         /Users/rania.midaoui/Library/Application Support/kopia/repository.config

Description:         Repository in S3: <our_url>
Hostname:            mbp-rania-midaoui
Username:            rania.midaoui
Read-only:           false
Format blob cache:   15m0s

Storage type:        s3
Storage capacity:    unbounded
Storage config:      {
                       "bucket": "de-instncs-0001-backup",
                       "prefix": "kopia/school-0031/",
                       "endpoint": "<endpoint>",
                       "accessKeyID": "<our_access_id>",
                       "secretAccessKey": "****************************************",
                       "sessionToken": ""
                     }

Unique ID:           <UID>
Hash:                <HASH>
Encryption:          AES256-GCM-HMAC-SHA256
Splitter:            DYNAMIC-4M-BUZHASH
Format version:      3
Content compression: true
Password changes:    true
Max pack length:     21 MB
Index Format:        v2

Epoch Manager:       enabled
Current Epoch: 0

Epoch refresh frequency: 20m0s
Epoch advance on:        20 blobs or 10.5 MB, minimum 24h0m0s
Epoch cleanup margin:    4h0m0s
Epoch checkpoint every:  7 epochs

[email protected]:~ $ kopia content stats
Count: 1
Total Bytes: 276 B
Average: 276 B
Histogram:

        0 between 0 B and 10 B (total 0 B)
        0 between 10 B and 100 B (total 0 B)
        1 between 100 B and 1 KB (total 304 B)
        0 between 1 KB and 10 KB (total 0 B)
        0 between 10 KB and 100 KB (total 0 B)
        0 between 100 KB and 1 MB (total 0 B)
        0 between 1 MB and 10 MB (total 0 B)
        0 between 10 MB and 100 MB (total 0 B)
[email protected]:~ $

Lyndon-Li · 2024-07-17T11:15:08Z

From the above output, the repo is empty.
If the restore in the source cluster works well, which means the repo data is there, most probably, you are referring to the wrong location in the target cluster.

RaniaMidaoui · 2024-07-17T13:39:24Z

@Lyndon-Li I retried with a new backup, made sure to connect the right bucket to the cluster where I restore, I verified the BackupStorageLocation, its the same as the other cluster and it says its available. Even when I run velero backup get I get the right backups.
The error is still the same.

And another thing, when I connect to the bucket and list Kopia snapshots, I still don't find anything, its empty.

Lyndon-Li · 2024-07-22T05:04:53Z

when I connect to the bucket and list Kopia snapshots, I still don't find anything, its empty.

What do you see in this bucket? Do you see a kopia prefix? If so, what do you see under the kopia prefix?

RaniaMidaoui · 2024-07-25T20:22:52Z

@Lyndon-Li Deleting all the contents of the backup bucket solved the issue, but that is not a good solution, just a temporary fix to keep implementing. We cannot do this in a production environment.
I don't know what exactly changed when we deleted the bucket contents, we didn't change anything else. Any ideas why this happened?

There is another error complaining about sync, similar to this one in this issue: kopia/kopia#1938
I don't know if it is related.

Any suggestions to what might have happened or how to actually fix the issue from your side?

Lyndon-Li · 2024-07-26T03:00:08Z

I don't think it is related to kopia issue 1938, because there is no error in the log you shared.
From the log you shared, the connection just succeeded but there is no data in the repo as if the repo was newly created in the target.

Therefore, I do need some more info to get what was happening, e.g., the questions I asked in #8019 (comment)

RaniaMidaoui · 2024-07-26T07:46:29Z

@Lyndon-Li sure, I can see a Kopia folder inside the bucket, inside it there was only some files that start with _log_* , it seems to be log files.

wkloucek · 2024-08-07T05:44:10Z

We encountered the snapshot not found error again last week, but when debugging in the beginning of this week, we couldn't reproduce it.

Maybe a short word about what we're doing: we're heavily switching between clusters for our backup / restore process development. Means we have a source-cluster where velero is running and creating backups and a target cluster where we do restores via velero. The S3 bucket is only accessible by one velero installation at a time (we can guarantee this because we use aws s3api put-bucket-policy with only one unique Principal). Maybe we trigger some weird caching effects during this switching back-and-forth.

We'll be still be attentive if it occurs another time.

What we've learnt during the debugging:

The snapshot ID referenced by Velero is actually the manifest ID of the Kopia snapshot, that can only been seen in the listing when you include this flag: kopia snapshot list --manifest-id --all

Lyndon-Li · 2024-08-07T05:55:19Z

Answer for all the related problems:

If you can see kopia repo data, e.g., when running kopia snapshot list --all you do see snapshots; or when running kopia content stats you see the repo is not empty, if may be a switch over problem. However, we don't expect this would happen and there is no known issue about this. Please collect the log bundle by running velero debug on both site when the problem happens so that we can further troubleshoot.
If you cannot see any data in the kopia repo, it may indicates that the source and target sites are not referring to the same object store location or your object store location has been destroyed. You need to double check your env and see what happened.

github-actions · 2024-10-07T02:03:08Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions · 2024-10-22T02:02:12Z

This issue was closed because it has been stalled for 14 days with no activity.

blackpiglet added area/fs-backup Kopia labels Jul 17, 2024

ywk253100 assigned Lyndon-Li Jul 22, 2024

Lyndon-Li mentioned this issue Aug 7, 2024

Don't try to create a new repo in the backup storage when BSL is readonly #8091

Open

github-actions bot added the staled label Oct 7, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restoring a file system backup to a different cluster failed due to Kopia snapshot not found #8019

Restoring a file system backup to a different cluster failed due to Kopia snapshot not found #8019

RaniaMidaoui commented Jul 16, 2024

Lyndon-Li commented Jul 17, 2024

Lyndon-Li commented Jul 17, 2024

RaniaMidaoui commented Jul 17, 2024

Lyndon-Li commented Jul 17, 2024

RaniaMidaoui commented Jul 17, 2024 •

edited

Loading

Lyndon-Li commented Jul 17, 2024

RaniaMidaoui commented Jul 17, 2024 •

edited

Loading

Lyndon-Li commented Jul 22, 2024

RaniaMidaoui commented Jul 25, 2024

Lyndon-Li commented Jul 26, 2024

RaniaMidaoui commented Jul 26, 2024 •

edited

Loading

wkloucek commented Aug 7, 2024

Lyndon-Li commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Oct 7, 2024

github-actions bot commented Oct 22, 2024

Restoring a file system backup to a different cluster failed due to Kopia snapshot not found #8019

Restoring a file system backup to a different cluster failed due to Kopia snapshot not found #8019

Comments

RaniaMidaoui commented Jul 16, 2024

Lyndon-Li commented Jul 17, 2024

Lyndon-Li commented Jul 17, 2024

RaniaMidaoui commented Jul 17, 2024

Lyndon-Li commented Jul 17, 2024

RaniaMidaoui commented Jul 17, 2024 • edited Loading

Lyndon-Li commented Jul 17, 2024

RaniaMidaoui commented Jul 17, 2024 • edited Loading

Lyndon-Li commented Jul 22, 2024

RaniaMidaoui commented Jul 25, 2024

Lyndon-Li commented Jul 26, 2024

RaniaMidaoui commented Jul 26, 2024 • edited Loading

wkloucek commented Aug 7, 2024

Lyndon-Li commented Aug 7, 2024 • edited Loading

github-actions bot commented Oct 7, 2024

github-actions bot commented Oct 22, 2024

RaniaMidaoui commented Jul 17, 2024 •

edited

Loading

RaniaMidaoui commented Jul 17, 2024 •

edited

Loading

RaniaMidaoui commented Jul 26, 2024 •

edited

Loading

Lyndon-Li commented Aug 7, 2024 •

edited

Loading