Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Panic with v1.6.6 #3706

Closed
2 tasks done
cbowlby-bt opened this issue Jul 5, 2024 · 3 comments
Closed
2 tasks done

Runtime Panic with v1.6.6 #3706

cbowlby-bt opened this issue Jul 5, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@cbowlby-bt
Copy link

cbowlby-bt commented Jul 5, 2024

We recently upgraded from Argo Rollouts 1.4.x to 1.6.6 to see if we could resolve a few underlying panics that seem to be happening. However, we are still seeing high number of panics whenever applications trigger an experiement, and we generally will get the following log entry:

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 362 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x2799ca0, 0x4785730})
	/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateWeightDestinationsFromExperiment(0xc00593b800)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:375 +0x27f
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileTrafficRouting(0xc00593b800)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:198 +0x80f
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutCanary(0xc00593b800)
	/go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:57 +0x1f6
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc00593b800)
	/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xe7
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc000572380, {0x323e2c0, 0xc00021d590}, {0xc00579ccc0, 0x29})
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:430 +0x4d3
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x324bcb0?, 0xc00016a0e0}, {0x2c05345, 0x7}, 0xc002397e70, {0x323e2c0, 0xc00021d590}, 0xc0005f6540?, {0x2641800, 0xc0031a98a0})
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x323e2c0, 0xc00021d590}, {0x324bcb0, 0xc00016a0e0}, {0x2c05345, 0x7}, 0x0?, 0xc00005c020?)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:351 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x32198a0, 0xc001468270}, 0x1, 0xc00047d920)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000bda7b0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x8e3e4a?, 0x0?, 0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:92 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:350 +0xa7

There is a large mix of experiments that we have in play, but the one that seems to trigger this most often is a basic placeholder experiment that simply returns an exit 0 response code, and is a one-liner that just sends "quitquitquit" to the experiment. Its used during the initial development of an application before being fleshed out with a full experiment and analysis during the final stages of development.

However, its not the only case where we get those log entries, and our fully fleshed out experiments also seem to trigger this log entry.

On top of that it seems to happen far more frequently then we'd expect for a handful of deployments over the course of an hour or day, in that yesterday we had 699 log entries vs ~20 deployments.

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

To Reproduce

Create a basic rollout that can consume the following analysis template:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: test-analysis
spec:
  metrics:
    - name: test-analysis
      provider:
        job:
          spec:
            template:
              spec:
                containers:
                  - name: exit-container
                    image: 'curlimages/curl:8.8.0'
                    command: [sh, -c, "echo 'sending quitquitquit' && curl -fsI -X POST http://localhost:15020/quitquitquit && exit 0"]
                restartPolicy: Never
            backoffLimit: 0

This is the most barebones experiment we see that can trigger this.

Expected behavior

The experiment should just exist cleanly and not trigger a panic, but more often than not the panic is triggered.

Version

v.1.4.1
v1.6.6 (currently deployed)

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@cbowlby-bt cbowlby-bt added the bug Something isn't working label Jul 5, 2024
@chetan-rns
Copy link
Member

I think it's fixed in v1.7. PR that introduced the nil pointer check: #2734

@cbowlby-bt
Copy link
Author

@chetan-rns thanks, we'll keep an eye on it for a bit and see if it stops. I'll close after a few days if it seems squashed.

@cbowlby-bt
Copy link
Author

The exceptions do seem to be cleared up, thank you, marking this closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants