Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero backups fail due to failing to connect to kopia repo: mmap error: cannot allocate memory #8502

Open
kkavin opened this issue Dec 10, 2024 · 18 comments

Comments

@kkavin
Copy link

kkavin commented Dec 10, 2024

What steps did you take and what happened:
We have installed Velero version 1.14.0, after upgrading to this version we are facing the below error and the backups are failing for few customers. We are facing this type of issue newly.

"level=error msg="pod volume backup failed: error to initialize data path: error to boost backup repository connection default--kopia error to connect backup repo: error to connect repo with storage: error to connect to repository: unable to create shared content manager:"

What did you expect to happen:
Backups need to be completed successfully

Environment:

  • Velero version (use velero version): v1.14.0
  • Velero features (use velero client config get features): v1.12.3
  • Kubernetes version (use kubectl version): v1.30.1
  • Kubernetes installer & version: v1,30.5-gke.1014003
  • Cloud provider or hardware configuration: GCP
  • OS (e.g. from /etc/os-release): Ubuntu 22.04

velero_pod.log
velero_describe.txt
velero_backup.log
bundle-2024-12-10-07-25-40.zip

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Gui13
Copy link

Gui13 commented Dec 10, 2024

We have the same error, and this happens when there's a maintenance in progress.

We narrowed down the issue to a vm.max_map_count being too small, we are looking for a solution to have Terraform set the limits a bit higher.

We are also hopeful that upgrading to 1.15.1 (once out, with the Azure Workload Identity fix) will settle these issues: 1.15 embarks a newer version of Kopia with many performance fixes.

@Lyndon-Li
Copy link
Contributor

Thanks for the clue @Gui13 , from the code, this may be related.

@Lyndon-Li
Copy link
Contributor

@kkavin
This problem should be related to the excessive number of index blobs, and so should be related to your repo data.
To help me understand the scale of your repo data, please help to collect the info as mentioned in below two comments:
#8469 (comment)
#8469 (comment)

@kkavin
Copy link
Author

kkavin commented Dec 11, 2024

HI @Lyndon-Li ,
Please find the details below

We are facing an issue with one repository, so I have gathered details from another repository experiencing the same issue in a different cluster.

kopia-error

kopia_blob.txt
kopia_content.txt
kopia_deleted.txt
kopia_epoch.txt
kopia_index_list.json
kopia_maintenance.json
kopia_prefix.txt
kopia_repo_status.txt
kopia_snapshot.txt

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 11, 2024

@kkavin
Looks like repo maintenance has not fully completed for a long time, as a result, there are a large number of index blobs:

Epoch Started  2024-09-06 19:10:45 UTC
1 2024-09-06 22:02:29 UTC ... 2024-12-11 03:34:13 UTC, 64621 blobs, 29 MB, span 2285h31m44s

This number is too large for the followed Kopia operations to handle.

Could you share more info of what happened to the maintenance?

@msfrucht
Copy link
Contributor

msfrucht commented Dec 11, 2024

Kopia will warn about excessive index blobs past 1000. Kopia can still handle it with enough resources. But never seen one that excessive even with 100 million+ unique blocks identified by the splitter. That was still on the order 1000-1500 index blobs after several maintenace iterations.

Something must have happened to maintenance or is not occurring at all. Decent chance maintenance is failing due to the number of existing index blobs.

@kkavin
Copy link
Author

kkavin commented Dec 12, 2024

Hi @Lyndon-Li
We created manual backups to verify if the maintenance job would be triggered, but it was not created for the specific cluster.

@msfrucht
Copy link
Contributor

@kkavin Maintenance is triggered on a periodic schedule. The schedule is in the BackupRepository object for the (bsl, namespace, datamover type restic or kopia) combination under spec.maintenanceFrequency.

In the Velero install namespace there should be a set of Job objects, up to 3, if the default history settings haven't been changed per namespace. Even if the Pod failed to start for some reason, the Job objects should be there and the reason why should be in the status of the maintenance jobs.

The Job objects will be labeled with the key "velero.io/repo-name".

kubectl get job -n <velero install namespace> -l velero.io/repo-name

@kkavin
Copy link
Author

kkavin commented Dec 13, 2024

@msfrucht
Copy link
Contributor

msfrucht commented Dec 13, 2024

Unfortunately, kinda what I expected based on the description.

Some maintain jobs are succeeding for smaller namespaces, but the job is running out resources for the larger one causing the maintenance job to never succeed. With the maintenance job not succeeding the index blob issue got worse and worse until it is in the current state.

16m         Normal    Completed              job/nginx-example-default-kopia-tcrn7-maintain-job-1734098639022      Job completed
9m53s       Warning   BackoffLimitExceeded   job/cluster-default-kopia-s5pt5-maintain-job-1734098644007               Job has reached the specified backoff limit
16m         Warning   BackoffLimitExceeded   job/cluster-default-kopia-s5pt5-maintain-job-1734098283165               Job has reached the specified backoff limit
33m         Warning   BackoffLimitExceeded   job/cluster-default-kopia-s5pt5-maintain-job-1734097245634               Job has reached the specified backoff limit
45m         Warning   BackoffLimitExceeded   job/cluster-default-kopia-s5pt5-maintain-job-1734096566964               Job has reached the specified backoff limit

The container is running out of memory:

fatal error: runtime: cannot allocate memory

The default behavior is BestEffort, meaning no requests and limits set. If you haven't, you should probably set the request to the maximum available on your highest available memory node.

You can do this by editing the Velero deployment args list if it was set in the install. This will restart the Velero pod.

  • --maintenance-job-cpu-request X default value is 0.
  • --maintenance-job-mem-request X default value is 0.
  • --maintenance-job-cpu-limit X default value is 0.
  • --maintenance-job-mem-limit X default value is 0.

If there is not sufficient memory to run the maintenance job on the cluster you will have to add more memory or say allocate a cloud one temporarily large enough to do so (install Velero, a copy of the BSL. BSL secrets, and BackupRepository object). That will trigger Maintenance to run on the alternate cluster.

Some Kubernetes distributions will have extensions to give alerts in the GUI when these types of jobs fail beyond kubernetes default Events. If you have that, that would be a good idea given the consequences. Red Hat OpenShift has AlertManager, I don't know much about the others.

@Gui13
Copy link

Gui13 commented Dec 13, 2024

Hello @msfrucht, your insights in this thread are a boon to me. We've been struggling with erroneous velero backups for months, and I've learnt a lot just by following your interventions here.

Our backups are very problematic: some of them complete, more are failing completely, most are failing "partially". I think there are two reasons for this:

  • we have a large default namespace (about 50 nodes, ~600 pods) with lots of files in our PVCs. Velero is using one kopia repository per-namespace policy, which means every time we perform a snapshot, we have to load LOTS of index files.
  • maintenances are not passing quickly enough (or are killed with OOM), so the problem is compounding every time we add new snapshots, since the indexes are accumulating. This leads to a worsening loop where each new backup is preventing the execution of a maintenance, which leads to more indexes...

I'll try to modify our maintenance job limits to guarantee correct execution and will report here.

More importantly, I'm inching towards another solution to this issue: being able to have more than one Kopia repo per namespace.

For instance, if we could ask velero to find its "kopia repo name" from a label, we could split the repositories sizes and have WAY leaner maintenances (at the cost of having more of them in individual jobs, but I expect the K8S scheduler to eat this at breakfast).
Do you think this would be doable for people with large namespaces?

@msfrucht
Copy link
Contributor

msfrucht commented Dec 16, 2024

@Gui13

You can submit per-PVC or user selectable scope of Kopia repositories as an enhancement request, but right now the kopia repo is shared across each namespace in the BSL.

You can always trigger a new full backup by specifying an alternate BSL and changing the existing Schedule. A new full backup will have lower index count to possibly deal with the issue and get going forward backups on track. A full backup is not sufficient figure out the amount of indexes that would be there. The only way to tell that is to do backups long enough and regular maintenance until the number of snapshots and expiration has reached steady state. And that will change over time with nature and size of the data still.

Quick highlevel overview. https://kopia.io/docs/advanced/architecture/

600 Pods and PVC count isn't really what causes the problem. But can be indicative.

Kopia does deduplication based backup. Meaning, duplicate data (to some degree) and makes sure there is only once instance of the identified blocks. Blocks can be fixed size or identified through fingerprinting of variable sized blocks. Velero sets Kopia to use the fingerpring based Buzhash algorithm. Files below a certain size are not deduped because the processing time exceeds that of anything useful.

With the blocks identified a deduplication-based backup can basically be broken down into the following parts:

1., content - the identified data blocks
2. metadata - name, location, access modes, extended attributes of files - and the blocks used for each version of the backup
3. backup list - which points to which set of metadata to use
4. index - the set of files which locates the blocks found in the metadata in content

The advantage of grouping PVCs per namespace together is more likely to successfully deduplication larger amounts of data. Imagine a dozen copies of kubevirt virtualization VMs all running the namespace. Good chance they're all running the same OS and level. So the PVCs have huge quantities of duplicable eligible data.

The trigger for large index cache is the number of identifiable blocks and the number of backups.

Restic, also used by Velero does this in their own way that locks the repository during backup, restore, and maintenance operations.

Kopia does not, which results in duplicative indexes to avoid that type of locking. The index inefficiency is dealt with by maintenance jobs. Above 1000 index blobs, Kopia starts issuing warnings. The largest I've dealt with was about 100million+ so individual blocks and ended up with about ~1500 indexes after several maintenance iterations. That appears to be about as good as it gets for that repository.

Backup systems with higher resource requirements will often use a dedicated server and database to manage indexes-equivalent. Kopia is not that. You may want to consider using Velero for k8s objects by turning off data movement and use something considerably closer to the underlying storage for backup of the volume data if this is going to be a problem.

Velero also puts all kopia cache data on local storage via ephemeral volumes in 1.15 and below which can issues with restricted ephemeral-storage environments - which is not uncommon on large environments. Unlike the metadata and data cache which defaults to 5GB max, the size of the index cache is unbounded. The 100million+ block object environment I mentioned used 25-30GB of cache due to indexes.

@Gui13 You'll have to do some experimentation. Download the kopia cli to examine your repositories. They are located in the object storage at /kopia/. Browse your object repo if necessary to find it. This will show you a great deal of repo information.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 17, 2024

@kkavin @Gui13
Here are some information:

  1. The current problem is caused by miss of repo maintenance. It doesn't happen in normal cases, but once it exists, it is not only happening to repo maintenance.
  2. In all circumstances, you should keep your repo maintenance jobs healthy and running, otherwise, your repo performance will be significantly downgraded or even denied to service (the current issue is an example)
  3. Backup repository as well as other components of the data path are resource consuming, so you should assure enough resources according to your data scale, see the guidance https://velero.io/docs/v1.15/performance-guidance/. Or you could use the recommended policy, BestEfforts.
  4. We are keeping optimizing the process of repo maintenance as well as other processes that are related to data scalability. Some have been fixed and some are still open.
  5. We have delivered what we have done in 1.15, so we recommend you upgrade to 1.15. Especially, the memory usage is significantly reduced in some scenarios of repo maintenance, which should be helpful for your case.

@Lyndon-Li
Copy link
Contributor

Let me keep this issue open for tracking the direct cause only --- running out of memory mapping areas.

@Lyndon-Li Lyndon-Li changed the title Velero backups are failing after upgrading to velero 1.14.0 getting "level=error msg="pod volume backup failed: error to initialize data path: error to boost backup repository connection default--kopia: error to connect backup repo: error to connect repo with storage: error to connect to repository: unable to create shared content manager: error loading indexes: unable to open pack index \"xn37_ffef91c3ab1d666955be0b42d277e1d4-s316f373f808b25f312f-c1\": mmap error: cannot allocate memory" Velero backups are failing due to failing to connect to kopia repo: : mmap error: cannot allocate memory Dec 17, 2024
@Lyndon-Li Lyndon-Li changed the title Velero backups are failing due to failing to connect to kopia repo: : mmap error: cannot allocate memory Velero backups fail due to failing to connect to kopia repo: : mmap error: cannot allocate memory Dec 17, 2024
@Lyndon-Li Lyndon-Li changed the title Velero backups fail due to failing to connect to kopia repo: : mmap error: cannot allocate memory Velero backups fail due to failing to connect to kopia repo: mmap error: cannot allocate memory Dec 17, 2024
@kkavin
Copy link
Author

kkavin commented Dec 17, 2024

Hi @Lyndon-Li @msfrucht
We have allocated more memory to the maintenance job, yet we are still facing partially failed backup issues. We did not encounter this type of issue in previous versions. Could you explain why this issue occurs in version 1.14.0 and later? Will it be fixed, or do you have any suggestions or solutions to address it in version 1.14?

I have updated the maintenance job logs after allocating the additional memory, and we no longer see the error message: "fatal error: runtime: cannot allocate memory."

trident.txt
maintain.txt

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 17, 2024

We have allocated more memory to the maintenance job, yet we are still facing partially failed backup issues.

As I mentioned, once this problem happens, it will happen for all repo activities. Because this indicates the repo is in a very bad status, in order to fix the problem, you must run one or more successful full maintenance.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 17, 2024

We did not encounter this type of issue in previous versions

This is because the repo'd just reached this status recently. Since you haven't run maintenance for a long time, things gradually get worse until the repo cannot tolerate and deny the service.
Another clue is that from velero 1.14/kopia 0.17, some index management work has been moved to repo maintenance, so repo maintenance is even more important.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 17, 2024

I have updated the maintenance job logs after allocating the additional memory, and we no longer see the error message

For your current problem, you need to wait more than one completion of repo maintenance. For velero 1.14/kopia 0.17, only full maintenance is effective because there is a bug that quick maintenance won't do some related tasks.
And even though you wait some maintenance and the maintenance starts to fix the current repo problem, it won't complete, as mentioned above, the memory usage in velero 1.14/kopia 0.17 will be very huge in your case.

Therefore, I suggest you upgrade to 1.15 and wait for several repo maintenance until the problem is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants