Velero maintenance job failing with oomkilled and logger name="[index-blob-manager]" sublevel=error #8474

syedabbas011 · 2024-12-03T05:29:08Z

What steps did you take and what happened:
We have installed Velero version 1.14.0, we started facing issues after a few days. The Velero maintenance job pod is failing and Velero pod is restarting

time="2024-11-26T23:51:33Z" level=warning msg="active indexes [xn0_000301012dec8243b5846445a770d7e9-s36cadf8751419dd612e-c1
deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-11-26T23:52:00Z" level=warning msg="Found too many index blobs (2438), this may result in degraded performance.\n\nPlease ensure periodic repository maintenance is enabled or run 'kopia maintenance'." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[shared-manager]" sublevel=error
time="2024-11-26T23:52:00Z" level=info msg="Start to open repo for maintenance, allow index write on load" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:165"
time="2024-11-26T23:52:01Z" level=warning msg="active indexes [xn0_000301012dec8243b5846445a770d7e9-s36cadf8751419dd612e-c1 xn0_00280b34b036384ed96d5578acf4c6fb-se642334b6c6a9f9b12e-c1 xn0_0063062125fbddc7796399dc24e67ec9-s0ee97e381a38601b12f-c1 xn0_008deb665d149583be338e10fe647591-s690b6323aa26dd5012e-c1 xn0_0092e2fcacda8223a85b0a7699586ab4-se54981af127c1f4b12f-c1 xn0_00abc8ac6601e30814b8494501424606-se875ff6f29fdd0d412f-c1 xn0_00ade8ccce8aee4c4830601ee2e6ec10-s015f50b8b530ad2d12f-c1
Uploading bundle-2024-12-02-09-55-22.tar.gz…

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:

Velero version (use velero version): v1.14.0
Velero features (use velero client config get features): v1.12.3
Kubernetes version (use kubectl version): 1.28
Kubernetes installer & version: 1.28 9-gke.1000000
Cloud provider or hardware configuration:GCP
OS (e.g. from /etc/os-release): Ubuntu 22.04

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-12-03T09:27:22Z

For the mentioned errors with sublevel=error, they are expected and are not cause of the current problem.

Lyndon-Li · 2024-12-03T09:30:18Z

The cause for this problem is the memory usage exceeds the limit assigned to Velero server pod where the maintenance is running.
Maintenance jobs are resource consuming tasks, the default resource (cpu or memory) configuration may not fit your repository, if so, you can follow https://velero.io/docs/v1.14/customize-installation/ to increase the cpu/memory limit.

syedabbas011 · 2024-12-04T04:25:39Z

Hi @Lyndon-Li,
After increasing the resource of maintenance job, pod get stuck for 18h and below are the error log from the job pod.

level=warning msg="Found too many index blobs (3519), this may result in degraded performance.\n\nPlease ensure periodic repository maintenance is enabled or run 'kopia maintenance'." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[shared-manager]" sublevel=error
time="2024-12-03T09:57:25Z" level=info msg="Looking for active contents..." logModule=kopia/snapshotgc logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
time="2024-12-03T09:57:25Z" level=info msg="Looking for active contents..." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
time="2024-12-03T10:20:12Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/bigmap logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"
time="2024-12-03T10:20:12Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"
time="2024-12-03T10:21:16Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/bigmap logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"
time="2024-12-03T10:21:16Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"
time="2024-12-03T10:22:53Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/bigmap logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"

Lyndon-Li · 2024-12-04T05:05:28Z

pod get stuck

What do you mean by stuck? Is the pod still in Running state?

syedabbas011 · 2024-12-04T05:35:23Z

maintenance job running for 19hrs.

Lyndon-Li · 2024-12-04T05:51:49Z

Could you observe the cpu and memory usage of the maintenance job pod?
If it is in a high level, you may try 1.15, there are several significant performance improvements in 1.15.

Additionally, please share how much data has backed up to the repository.

syedabbas011 · 2024-12-04T06:38:49Z

backed up is 763 GB

Lyndon-Li · 2024-12-04T07:55:41Z

What is the size of most files in your backup?

syedabbas011 · 2024-12-04T09:33:18Z

main pvc has around 700 gb of data

Lyndon-Li · 2024-12-05T09:30:44Z

What is the file size in the volume?

syedabbas011 · 2024-12-05T11:33:56Z

how can i check file size?

kaovilai · 2024-12-05T20:32:06Z

Methods to Check File Sizes

Here's a breakdown of the most common and effective methods:

kubectl exec
- This is the most straightforward approach if you have a pod already running that utilizes the PVC.
- Steps:
  1. Access the pod: kubectl exec -it <pod-name> -n <namespace> -- bash
  2. Navigate to the mount path: This is defined in your pod's YAML under volumeMounts.
  3. Use standard Linux commands: ls -l, du -sh *, etc. to check file sizes.
Ephemeral Debug Container
- If you don't have an existing pod, you can create a temporary pod for debugging.
- Steps:
  1. Create a pod definition: Specify the PVC you want to inspect in the volumes and volumeMounts sections. Use a simple image like busybox.
  2. Deploy the pod: kubectl apply -f debug-pod.yaml
  3. Exec into the pod: As described in method 1, use kubectl exec to get a shell and inspect file sizes.
  4. Delete the debug pod: kubectl delete -f debug-pod.yaml
kubectl debug (Kubernetes 1.18+)
- This command offers a streamlined way to create an ephemeral container for debugging.
- Steps:
  1. Use kubectl debug: kubectl debug -it <pod-name> -n <namespace> --image=busybox --target=<container-name> (replace <container-name> if your pod has multiple containers)
  2. This will launch a new container: You'll have a shell within the existing pod's environment, with access to the mounted PVC.
  3. Inspect file sizes: Use the same Linux commands as mentioned earlier.

Important Considerations

RWO Limitation: Remember that with RWO, the PVC is attached to a single node. Your debug pod needs to be scheduled on the same node to access the volume.

https://g.co/gemini/share/3362ed88622b

ywk253100 added the backup-repo/Maintenance label Dec 4, 2024

Lyndon-Li added the scalability label Dec 9, 2024

ywk253100 added the Needs info Waiting for information label Dec 9, 2024

ywk253100 assigned Lyndon-Li Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero maintenance job failing with oomkilled and logger name="[index-blob-manager]" sublevel=error #8474

Velero maintenance job failing with oomkilled and logger name="[index-blob-manager]" sublevel=error #8474

syedabbas011 commented Dec 3, 2024

Lyndon-Li commented Dec 3, 2024

Lyndon-Li commented Dec 3, 2024 •

edited

Loading

syedabbas011 commented Dec 4, 2024 •

edited

Loading

Lyndon-Li commented Dec 4, 2024

syedabbas011 commented Dec 4, 2024

Lyndon-Li commented Dec 4, 2024

syedabbas011 commented Dec 4, 2024

Lyndon-Li commented Dec 4, 2024

syedabbas011 commented Dec 4, 2024

Lyndon-Li commented Dec 5, 2024

syedabbas011 commented Dec 5, 2024

kaovilai commented Dec 5, 2024

Velero maintenance job failing with oomkilled and logger name="[index-blob-manager]" sublevel=error #8474

Velero maintenance job failing with oomkilled and logger name="[index-blob-manager]" sublevel=error #8474

Comments

syedabbas011 commented Dec 3, 2024

Lyndon-Li commented Dec 3, 2024

Lyndon-Li commented Dec 3, 2024 • edited Loading

syedabbas011 commented Dec 4, 2024 • edited Loading

Lyndon-Li commented Dec 4, 2024

syedabbas011 commented Dec 4, 2024

Lyndon-Li commented Dec 4, 2024

syedabbas011 commented Dec 4, 2024

Lyndon-Li commented Dec 4, 2024

syedabbas011 commented Dec 4, 2024

Lyndon-Li commented Dec 5, 2024

syedabbas011 commented Dec 5, 2024

kaovilai commented Dec 5, 2024

Lyndon-Li commented Dec 3, 2024 •

edited

Loading

syedabbas011 commented Dec 4, 2024 •

edited

Loading