Runtime Panic with v1.6.6 #3706

cbowlby-bt · 2024-07-05T11:16:23Z

We recently upgraded from Argo Rollouts 1.4.x to 1.6.6 to see if we could resolve a few underlying panics that seem to be happening. However, we are still seeing high number of panics whenever applications trigger an experiement, and we generally will get the following log entry:

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 362 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x2799ca0, 0x4785730})
	/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateWeightDestinationsFromExperiment(0xc00593b800)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:375 +0x27f
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileTrafficRouting(0xc00593b800)
	/go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:198 +0x80f
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutCanary(0xc00593b800)
	/go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:57 +0x1f6
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc00593b800)
	/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xe7
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc000572380, {0x323e2c0, 0xc00021d590}, {0xc00579ccc0, 0x29})
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:430 +0x4d3
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x324bcb0?, 0xc00016a0e0}, {0x2c05345, 0x7}, 0xc002397e70, {0x323e2c0, 0xc00021d590}, 0xc0005f6540?, {0x2641800, 0xc0031a98a0})
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x323e2c0, 0xc00021d590}, {0x324bcb0, 0xc00016a0e0}, {0x2c05345, 0x7}, 0x0?, 0xc00005c020?)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:351 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x32198a0, 0xc001468270}, 0x1, 0xc00047d920)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000bda7b0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x8e3e4a?, 0x0?, 0x0?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:92 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:350 +0xa7

There is a large mix of experiments that we have in play, but the one that seems to trigger this most often is a basic placeholder experiment that simply returns an exit 0 response code, and is a one-liner that just sends "quitquitquit" to the experiment. Its used during the initial development of an application before being fleshed out with a full experiment and analysis during the final stages of development.

However, its not the only case where we get those log entries, and our fully fleshed out experiments also seem to trigger this log entry.

On top of that it seems to happen far more frequently then we'd expect for a handful of deployments over the course of an hour or day, in that yesterday we had 699 log entries vs ~20 deployments.

Checklist:

I've included steps to reproduce the bug.
I've included the version of argo rollouts.

To Reproduce

Create a basic rollout that can consume the following analysis template:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: test-analysis
spec:
  metrics:
    - name: test-analysis
      provider:
        job:
          spec:
            template:
              spec:
                containers:
                  - name: exit-container
                    image: 'curlimages/curl:8.8.0'
                    command: [sh, -c, "echo 'sending quitquitquit' && curl -fsI -X POST http://localhost:15020/quitquitquit && exit 0"]
                restartPolicy: Never
            backoffLimit: 0

This is the most barebones experiment we see that can trigger this.

Expected behavior

The experiment should just exist cleanly and not trigger a panic, but more often than not the panic is triggered.

Version

v.1.4.1
v1.6.6 (currently deployed)

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

chetan-rns · 2024-08-30T09:52:40Z

I think it's fixed in v1.7. PR that introduced the nil pointer check: #2734

cbowlby-bt · 2024-08-30T15:00:11Z

@chetan-rns thanks, we'll keep an eye on it for a bit and see if it stops. I'll close after a few days if it seems squashed.

cbowlby-bt · 2024-09-03T10:39:14Z

The exceptions do seem to be cleared up, thank you, marking this closed.

cbowlby-bt added the bug Something isn't working label Jul 5, 2024

cbowlby-bt closed this as completed Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime Panic with v1.6.6 #3706

Runtime Panic with v1.6.6 #3706

cbowlby-bt commented Jul 5, 2024 •

edited

Loading

chetan-rns commented Aug 30, 2024

cbowlby-bt commented Aug 30, 2024

cbowlby-bt commented Sep 3, 2024

Runtime Panic with v1.6.6 #3706

Runtime Panic with v1.6.6 #3706

Comments

cbowlby-bt commented Jul 5, 2024 • edited Loading

chetan-rns commented Aug 30, 2024

cbowlby-bt commented Aug 30, 2024

cbowlby-bt commented Sep 3, 2024

cbowlby-bt commented Jul 5, 2024 •

edited

Loading