Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero maintenance job failing with oomkilled and logger name="[index-blob-manager]" sublevel=error #8474

Open
syedabbas011 opened this issue Dec 3, 2024 · 12 comments
Assignees

Comments

@syedabbas011
Copy link

What steps did you take and what happened:
We have installed Velero version 1.14.0, we started facing issues after a few days. The Velero maintenance job pod is failing and Velero pod is restarting

time="2024-11-26T23:51:33Z" level=warning msg="active indexes [xn0_000301012dec8243b5846445a770d7e9-s36cadf8751419dd612e-c1
deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-11-26T23:52:00Z" level=warning msg="Found too many index blobs (2438), this may result in degraded performance.\n\nPlease ensure periodic repository maintenance is enabled or run 'kopia maintenance'." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[shared-manager]" sublevel=error
time="2024-11-26T23:52:00Z" level=info msg="Start to open repo for maintenance, allow index write on load" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:165"
time="2024-11-26T23:52:01Z" level=warning msg="active indexes [xn0_000301012dec8243b5846445a770d7e9-s36cadf8751419dd612e-c1 xn0_00280b34b036384ed96d5578acf4c6fb-se642334b6c6a9f9b12e-c1 xn0_0063062125fbddc7796399dc24e67ec9-s0ee97e381a38601b12f-c1 xn0_008deb665d149583be338e10fe647591-s690b6323aa26dd5012e-c1 xn0_0092e2fcacda8223a85b0a7699586ab4-se54981af127c1f4b12f-c1 xn0_00abc8ac6601e30814b8494501424606-se875ff6f29fdd0d412f-c1 xn0_00ade8ccce8aee4c4830601ee2e6ec10-s015f50b8b530ad2d12f-c1
Uploading bundle-2024-12-02-09-55-22.tar.gz…

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version): v1.14.0
  • Velero features (use velero client config get features): v1.12.3
  • Kubernetes version (use kubectl version): 1.28
  • Kubernetes installer & version: 1.28 9-gke.1000000
  • Cloud provider or hardware configuration:GCP
  • OS (e.g. from /etc/os-release): Ubuntu 22.04

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

For the mentioned errors with sublevel=error, they are expected and are not cause of the current problem.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 3, 2024

The cause for this problem is the memory usage exceeds the limit assigned to Velero server pod where the maintenance is running.
Maintenance jobs are resource consuming tasks, the default resource (cpu or memory) configuration may not fit your repository, if so, you can follow https://velero.io/docs/v1.14/customize-installation/ to increase the cpu/memory limit.

@syedabbas011
Copy link
Author

syedabbas011 commented Dec 4, 2024

Hi @Lyndon-Li,
After increasing the resource of maintenance job, pod get stuck for 18h and below are the error log from the job pod.

level=warning msg="Found too many index blobs (3519), this may result in degraded performance.\n\nPlease ensure periodic repository maintenance is enabled or run 'kopia maintenance'." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[shared-manager]" sublevel=error
time="2024-12-03T09:57:25Z" level=info msg="Looking for active contents..." logModule=kopia/snapshotgc logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
time="2024-12-03T09:57:25Z" level=info msg="Looking for active contents..." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
time="2024-12-03T10:20:12Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/bigmap logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"
time="2024-12-03T10:20:12Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"
time="2024-12-03T10:21:16Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/bigmap logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"
time="2024-12-03T10:21:16Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"
time="2024-12-03T10:22:53Z" level=warning msg="unable to create memory-mapped segment: unable to create memory-mapped file: open : no such file or directory" logModule=kopia/bigmap logSource="pkg/kopia/kopia_log.go:96" logger name="[shared-manager]"

@Lyndon-Li
Copy link
Contributor

pod get stuck

What do you mean by stuck? Is the pod still in Running state?

@syedabbas011
Copy link
Author

maintenance job running for 19hrs.

@Lyndon-Li
Copy link
Contributor

Could you observe the cpu and memory usage of the maintenance job pod?
If it is in a high level, you may try 1.15, there are several significant performance improvements in 1.15.

Additionally, please share how much data has backed up to the repository.

@syedabbas011
Copy link
Author

backed up is 763 GB

@Lyndon-Li
Copy link
Contributor

What is the size of most files in your backup?

@syedabbas011
Copy link
Author

main pvc has around 700 gb of data

@Lyndon-Li
Copy link
Contributor

What is the file size in the volume?

@syedabbas011
Copy link
Author

how can i check file size?

@kaovilai
Copy link
Member

kaovilai commented Dec 5, 2024

Methods to Check File Sizes

Here's a breakdown of the most common and effective methods:

  1. kubectl exec

    • This is the most straightforward approach if you have a pod already running that utilizes the PVC.
    • Steps:
      1. Access the pod: kubectl exec -it <pod-name> -n <namespace> -- bash
      2. Navigate to the mount path: This is defined in your pod's YAML under volumeMounts.
      3. Use standard Linux commands: ls -l, du -sh *, etc. to check file sizes.
  2. Ephemeral Debug Container

    • If you don't have an existing pod, you can create a temporary pod for debugging.
    • Steps:
      1. Create a pod definition: Specify the PVC you want to inspect in the volumes and volumeMounts sections. Use a simple image like busybox.
      2. Deploy the pod: kubectl apply -f debug-pod.yaml
      3. Exec into the pod: As described in method 1, use kubectl exec to get a shell and inspect file sizes.
      4. Delete the debug pod: kubectl delete -f debug-pod.yaml
  3. kubectl debug (Kubernetes 1.18+)

    • This command offers a streamlined way to create an ephemeral container for debugging.  
    • Steps:
      1. Use kubectl debug: kubectl debug -it <pod-name> -n <namespace> --image=busybox --target=<container-name> (replace <container-name> if your pod has multiple containers)
      2. This will launch a new container: You'll have a shell within the existing pod's environment, with access to the mounted PVC.
      3. Inspect file sizes: Use the same Linux commands as mentioned earlier.

Important Considerations

  • RWO Limitation: Remember that with RWO, the PVC is attached to a single node. Your debug pod needs to be scheduled on the same node to access the volume.

https://g.co/gemini/share/3362ed88622b

@ywk253100 ywk253100 added the Needs info Waiting for information label Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants