Rollout got stuck #2522

alonbehaim · 2023-01-18T13:20:56Z

Checklist:

I've included steps to reproduce the bug.
I've included the version of argo rollouts.

Describe the bug
From time to time we're facing issue that when we're deploying new version and argo rollout need to rollout new pods, it just creating the new RS and then got stuck, when promoting manually the rollouts or restarting argo-rollout controller it fix the issue.

To Reproduce
N/A

Expected behavior
Rollout will proceed with deployment of changes even if something else edited the rollout object

Version
1.3.2

Logs
Attached both full log of controller and for the rollouts called my-first-app:
full-controller-logs.csv
my-first-app-logs.csv

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

zachaller · 2023-01-18T17:39:17Z

So two questions are you running argo rollouts in HA mode aka do you have two controllers running at once? Second, do you have any other controller or opa like agent that makes modifications to the rollout object, because based on the logs the rollouts controller seems to be fighting with another controller.

alonbehaim · 2023-01-18T21:19:10Z

@zachaller we followed the recommendations from #1904 and we scaled down the controller to be 1 replica, this is the only argo rollouts controller that running in the cluster, we have the same setup over multiple clusters and this issue reproduced in all of them.
We also validated during the time that it got stuck that the last change performed by argo rollouts controller itself, in addition we validated that no other controller that can edit argo rollouts.
As part our investigation we also tried to pass this argument --rollout-threads=1 but it still happened.

We considered few things that can might cause it, maybe you will be able to shed light on it:

Is it possible that ArgoCD applying some change at the same moment that rollout start to apply the change? foreach deployment spec.strategy.canary.analysis.args[4].value changed in rollout object, attached rollout manifest rollout.yaml
Can it be related to the fact that argo rollouts updating the RS with canary/stable annotations and labels?
We see the following error also when we're not rolling out new version and after few moments it stopped, is it normal thing?

rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps "automations-7bf7b986f4": the object has been modified; please apply your changes to the latest version and try again

zachaller · 2023-01-18T21:30:47Z

So the only time I have generally seen this is when there is some admission controller or something else fighting with rollouts controller. If you have this happen every time and are able to reproduce consistently I wonder if it is not something related to that as in the 1904 case it ended up being laceworks replicasets could also be being modified by other controllers. Rollouts will retry in those cases generally as long as something is not always updating the rollout underneath it.

alonbehaim · 2023-01-18T21:40:08Z

@zachaller the issue is not consistent at all, it happens once in 10-20 deployments and unfortunately we can't reproduce it, anything you suggest to check next time it happens?

For question number 3, is it normal that we see this error even many times during the day even that all good and everything is working as expected?

zachaller · 2023-01-18T22:45:43Z

For question number 3, is it normal that we see this error even many times during the day even that all good and everything is working as expected?

Yes, it can be normal and rollouts should retry updating the rs as seen here what happened is something updated the replicaset it could be an hpa or some other controller even the built in rs controller. Rollouts controller itself could also do it but I think the controller is pretty good about not clobbering itself with any threads etc.

zachaller · 2023-01-18T22:52:58Z

Could I also have you try upgrading to 1.4, there was some changes to leader election that actually might fix the issue by auto restarting the controller on issues such as k8s api failures which could maybe cause it.

alonbehaim · 2023-01-19T08:21:44Z

thanks @zachaller, we will upgrade to v1.4 and check if it's solving the issue

zachaller · 2023-01-27T19:04:52Z

@alonbehaim did you see any improvement on v1.4?

alonbehaim · 2023-01-29T21:15:40Z

@zachaller unfortunately we see the same behavior also with v1.4.
Attached logs from today with rollout my-app-2-transaction-recovery
rollout_logs.csv

alonbehaim · 2023-02-14T23:04:53Z

@zachaller any idea how we can continue debug this issue?

zachaller · 2023-02-17T21:57:42Z

So I still think this is something else modifying the rollout object outside of the rollouts controller, I have a possible patch that might get us some more info and or possible fix the issue. This should not be needed in the case no other controller modifies the ro object right away because even if we do get a conflict we do eventually retry this will just trigger it to happen sooner. Are you comfortable building this patch into a custom version and running it.

diff --git a/rollout/controller.go b/rollout/controller.go
index 5b2a4eb9..58d62c71 100644
--- a/rollout/controller.go
+++ b/rollout/controller.go
@@ -415,8 +415,14 @@ func (c *Controller) syncHandler(ctx context.Context, key string) error {
        err = roCtx.reconcile()
        if roCtx.newRollout != nil {
                c.writeBackToInformer(roCtx.newRollout)
+               logCtx.Warnf("newRollout written to informer '%s': %w", roCtx.newRollout.Name, err)
        }
        if err != nil {
+               if k8serrors.IsConflict(err) {
+                       logCtx.Warnf("Retrying updating of Rollout '%s': %w", roCtx.newRollout.Name, err)
+                       c.enqueueRolloutAfter(roCtx.newRollout, time.Second)
+                       return nil
+               }
                logCtx.Errorf("roCtx.reconcile err %v", err)
        }
        return err

alonbehaim · 2023-02-18T20:52:58Z

@zachaller thanks!
I created this PR #2607 with the change so test will be run as well, we will update it in the next few days and I'll let you know what was the result

alonbehaim · 2023-02-22T20:46:44Z

@zachaller today we finished with upgrading all argo rollouts instances we have, so far looks good but it sometimes happened once a week so we still need to wait in order to confirm it solving the issue.

We see this new log sometime printed when there is no error

newRollout written to informer 'my-app-1': <nil>

I think it will be more helpful to check if there is an err before printing this log

        err = roCtx.reconcile()
        if roCtx.newRollout != nil {
                c.writeBackToInformer(roCtx.newRollout)
+               if err != nil {
+                        logCtx.Warnf("newRollout written to informer '%s': %w", roCtx.newRollout.Name, err)
+               }
        }
        if err != nil {
+               if k8serrors.IsConflict(err) {
+                       logCtx.Warnf("Retrying updating of Rollout '%s': %w", roCtx.newRollout.Name, err)
+                       c.enqueueRolloutAfter(roCtx.newRollout, time.Second)
+                       return nil
+               }
                logCtx.Errorf("roCtx.reconcile err %v", err)
        }
        return err

wdyt?

zachaller · 2023-02-22T21:13:08Z

@alonbehaim Yea I missed that that is a good check to add to reduce noise.

alonbehaim · 2023-02-27T13:09:46Z

@zachaller yesterday we faced hanging rollout, it happened after new rollout started while the rollout was in the middle of canary rollout of previous deployment.
I saw this error all over the place, I guess it's related to recent changes:

logs

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 175387 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c24a0, 0x3c11ef0})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0004aca80, {0x29d9ad0, 0xc000051800}, {0xc003a8b480, 0x34})
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:424 +0x6c8
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e5b40?, 0xc0000a2640}, {0x25899bc, 0x7}, 0xc006e9de70, {0x29d9ad0, 0xc000051800}, 0x10000c0005fb5b8?, {0x2094760, 0xc0056f63a0})
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d9ad0, 0xc000051800}, {0x29e5b40, 0xc0000a2640}, {0x25899bc, 0x7}, 0x0?, 0xc0005fb720?)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0007581b0?)
	/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000760000?, {0x29b99e0, 0xc0052dc210}, 0x1, 0xc000648e40)
	/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000e8a7e0?, 0x3b9aca00, 0x0, 0x0?, 0x29c2b70?)
	/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0005fb790?, 0xc000e8a7e0?, 0xc0005fb7a0?)
	/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7

full_logs.csv

zachaller · 2023-02-27T17:29:29Z

ahh roCtx.newRollout was nil which is valuable information because it let's us know that something in reconcile failed to create a newRollout object. We should however also try requeue using the old roCtx.rollout instead of newRollout when we see that conflict.

zachaller · 2023-02-27T17:54:58Z

also one other good bit of info is this log line Error: updating rollout revision that only happens in two spots in rollouts/sync.go the one on line 106 is a bit interesting to me, I will look into that one a bit more as well.

alonbehaim · 2023-03-06T21:36:44Z

Happened again, I didn't find something new there.
This time it happened just when tried to update exist rollout that wasn't in the middle of other rollout.
new_logs.csv

zachaller · 2023-03-06T21:51:23Z

the logs you just posted where those with requeueing roCtx.rollout and not roCtx.newRollout

alonbehaim · 2023-03-08T18:14:51Z

the logs you just posted where those with requeueing roCtx.rollout and not roCtx.newRollout

No it's still using roCtx.newRollout.
I wonder if I'll push the following change, it won't got stuck in the previous version instead promote the new one?

        err = roCtx.reconcile()
        if roCtx.newRollout != nil {
                c.writeBackToInformer(roCtx.newRollout)
+               if err != nil {
+                        logCtx.Warnf("newRollout written to informer '%s': %w", roCtx.newRollout.Name, err)
+               }
        }
        if err != nil {
+		if k8serrors.IsConflict(err) {
+			logCtx.Warnf("Retrying updating of Rollout '%s': %v", roCtx.newRollout.Name, err)
+			if roCtx.newRollout != nil {
+				c.enqueueRolloutAfter(roCtx.newRollout, time.Second)
+			} else {
+				c.enqueueRolloutAfter(roCtx.rollout, time.Second)
+			}
			return nil
		}
                logCtx.Errorf("roCtx.reconcile err %v", err)
        }
        return err

zachaller · 2023-03-08T20:20:53Z

c.enqueueRolloutAfter only uses the namespace/name of the rollout object so using just roCtx.rollout should be good enough to cover both cases but yes running using roCtx.rollout in the requeue is worth the test.

alonbehaim · 2023-03-08T20:35:16Z

c.enqueueRolloutAfter only uses the namespace/name of the rollout object so using just roCtx.rollout should be good enough to cover both cases but yes running using roCtx.rollout in the requeue is worth the test.

Great, so tomorrow I'll apply following change with the logs so it will be easier to track what happened

        err = roCtx.reconcile()
        if roCtx.newRollout != nil {
                c.writeBackToInformer(roCtx.newRollout)
+               if err != nil {
+                        logCtx.Warnf("newRollout written to informer '%s': %w", roCtx.newRollout.Name, err)
+               }
        }
        if err != nil {
+		if k8serrors.IsConflict(err) {
+			if roCtx.newRollout != nil {
+				logCtx.Warnf("Retrying updating of newRollout '%s': %v", roCtx.newRollout.Name, err)
+				c.enqueueRolloutAfter(roCtx.newRollout, time.Second)
+			} else {
+				logCtx.Warnf("Retrying updating of Rollout '%s': %v", roCtx.rollout.Name, err)
+				c.enqueueRolloutAfter(roCtx.rollout, time.Second)
+			}
			return nil
		}
                logCtx.Errorf("roCtx.reconcile err %v", err)
        }
        return err

alonbehaim · 2023-03-15T21:32:22Z

@zachaller we deployed the suggested fix from previous comment in our production, unfortunately we faced the same issue of hanging rollout.
logs.csv

alexef · 2023-03-31T10:10:55Z

we also ran into this today:

Mar 31, 2023 @ 11:24:53.009 | E0331 09:24:53.009565       1 controller.go:174] Operation cannot be fulfilled on replicasets.apps "booking-assistant-79dc857c5d": the object has been modified; please apply your changes to the latest version and try again
-- | --

  | Mar 31, 2023 @ 11:24:53.009 | time="2023-03-31T09:24:53Z" level=error msg="roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \"booking-assistant-79dc857c5d\": the object has been modified; please apply your changes to the latest version and try again" generation=26 namespace=booking-assistant resourceVersion=1650552507 rollout=booking-assistant

  | Mar 31, 2023 @ 11:24:53.009 | time="2023-03-31T09:24:53Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps \"booking-assistant-79dc857c5d\": the object has been modified; please apply your changes to the latest version and try again" namespace=booking-assistant rollout=booking-assistant

  | Mar 31, 2023 @ 11:24:52.872 | E0331 09:24:52.872526       1 controller.go:174] Operation cannot be fulfilled on replicasets.apps "booking-assistant-79dc857c5d": the object has been modified; please apply your changes to the latest version and try again

  | Mar 31, 2023 @ 11:24:52.872 | time="2023-03-31T09:24:52Z" level=error msg="roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \"booking-assistant-79dc857c5d\": the object has been modified; please apply your changes to the latest version and try again" generation=26 namespace=booking-assistant resourceVersion=1650552481 rollout=booking-assistant

rollback/deploy solved it. running latest 1.4

zachaller · 2023-03-31T13:35:45Z

@alexef yours seems slightly different in that it is with the replica set, which can happen for a few reasons such as HPA scaling, but retries should make that eventually go through it sounds like it got stuck instead. Did you also see a log like this where it failed on the rollout itself roCtx.reconcile err Operation cannot be fulfilled on rollouts.argoproj.io

And do you also see a log line from
logCtx.Infof("%s syncHandler queue retries: %v : key \"%v\"", objType, workqueue.NumRequeues(key), key)

zachaller · 2023-03-31T15:52:45Z

I am wondering if this diff would fix it.

diff --git a/rollout/controller.go b/rollout/controller.go
index 1fa0c73c..29586ebf 100644
--- a/rollout/controller.go
+++ b/rollout/controller.go
@@ -413,13 +413,14 @@ func (c *Controller) syncHandler(ctx context.Context, key string) error {
        }
 
        err = roCtx.reconcile()
-       if roCtx.newRollout != nil {
-               c.writeBackToInformer(roCtx.newRollout)
-       }
        if err != nil {
                logCtx.Errorf("roCtx.reconcile err %v", err)
+               return err
        }
-       return err
+       if roCtx.newRollout != nil {
+               c.writeBackToInformer(roCtx.newRollout)
+       }
+       return nil
 }
 
 // writeBackToInformer writes a just recently updated Rollout back into the informer cache.

The output code looking like this

        err = roCtx.reconcile()
	if err != nil {
		logCtx.Errorf("roCtx.reconcile err %v", err)
		return err
	}
	if roCtx.newRollout != nil {
		c.writeBackToInformer(roCtx.newRollout)
	}
	return nil

The theory behind this patch is this, we get an error from the rollouts reconcile loop which is this function call roCtx.reconcile() then in the code currently on master we would go and update the informer cache manually as long as roCtx.newRollout != nil .

If we go look at function syncReplicaSetRevision we call c.setRolloutRevision which if it completes which I think it generally should will initialize c.newRollout = updatedRollout. Then what happens is back in syncReplicaSetRevision we try to update the status updatedRollout, err := c.argoprojclientset.ArgoprojV1alpha1().Rollouts(c.rollout.Namespace).UpdateStatus(ctx, c.rollout, metav1.UpdateOptions{}). This fails because something modified the rollout. Which kicks us back the the code in the patch where we now have a set roCtx.newRollout but an error from the reconcile. The state we update the informer cache with is now bad because its from failed update. I think this could possible cause a loop, it depends a bit on how the informer cache gets refreshed etc.

draft pr: #2689

zachaller · 2023-03-31T16:56:21Z

@alonbehaim Would you be willing to try out the new code?

zachaller · 2023-04-07T15:31:05Z

@alonbehaim any update?

gdvalle · 2023-04-10T17:24:18Z

@zachaller I tried the patch and was still met with a stuck rollout. We have a gatekeeper agent performing some mutations and argo seems to reliably hang at least once a week.

zachaller · 2023-04-28T18:10:25Z

@gdvalle This particular patch would not help the case of some external system messing with the rollout resource. We would probably have to do something else to help that case.

triplewy · 2024-03-14T17:26:45Z

@zachaller So if I understand correctly, if we have an HPA that scales a rollout, this will come into conflict with the argo-rollouts controller? Because our setup is simple where we have a rollout that can be scaled by an HPA and we occassionaly get the error: rollout syncHandler error: error updating replicaset : Operation cannot be fulfilled on replicasets.apps \"xxx": the object has been modified; please apply your changes to the latest version and try again" namespace=xxx rollout=xxx

Laakso · 2024-05-02T06:56:00Z

@triplewy This seems to happen with us as well with the HPA.

DimKanoute · 2024-05-27T10:28:49Z

Hey, coming here to raise that issue and try to open the discussion again. We face the exact same issue with HPA.

vadasambar · 2024-10-04T13:35:15Z

Can confirm this problem happens in v1.7.1
We are using rollouts with HPA but I don't think it is related to HPA.
Rollouts controller seems to miss the new replicaset update and never really gets the new replicaset update until the pod is restarted.

... repeats until

vadasambar · 2024-10-04T13:38:24Z

You can see the rollouts controller is fighting with itself ^

vadasambar · 2024-10-15T11:44:07Z

/re-open

vadasambar · 2024-10-15T11:44:42Z

@alonbehaim can we re-open this issue. I don't think it is fixed yet.

zachaller · 2024-10-20T22:25:47Z

I am aware of this issue, there are other improvments coming in 1.8 as well as some threads in slack where we are discussing the informer cache being stale and possible solutions that should make it into 1.8

vadasambar · 2024-10-21T13:39:19Z

@zachaller thank you for the update. Looking forward to 1.8. I did skim through #argo-rollouts on CNCF slack and community meeting docs but didn't find anything related to this (maybe I am looking at the wrong place or not looking closely enough).

vadasambar · 2024-10-21T13:53:59Z

Nvm, found some threads around this:

alonbehaim added the bug Something isn't working label Jan 18, 2023

zachaller mentioned this issue Mar 31, 2023

fix(controller): Fix for rollouts getting stuck in loop #2689

Merged

zachaller closed this as completed in #2689 May 1, 2023

zachaller mentioned this issue Oct 10, 2023

fix: keep rs informer updated #3091

Merged

Rollout got stuck #2522

Rollout got stuck #2522

Comments

alonbehaim commented Jan 18, 2023

zachaller commented Jan 18, 2023

alonbehaim commented Jan 18, 2023

zachaller commented Jan 18, 2023

alonbehaim commented Jan 18, 2023 • edited Loading

zachaller commented Jan 18, 2023 • edited Loading

zachaller commented Jan 18, 2023 • edited Loading

alonbehaim commented Jan 19, 2023

zachaller commented Jan 27, 2023

alonbehaim commented Jan 29, 2023

alonbehaim commented Feb 14, 2023

zachaller commented Feb 17, 2023 • edited Loading

alonbehaim commented Feb 18, 2023 • edited Loading

alonbehaim commented Feb 22, 2023

zachaller commented Feb 22, 2023

alonbehaim commented Feb 27, 2023 • edited Loading

zachaller commented Feb 27, 2023 • edited Loading

zachaller commented Feb 27, 2023

alonbehaim commented Mar 6, 2023 • edited Loading

zachaller commented Mar 6, 2023

alonbehaim commented Mar 8, 2023

zachaller commented Mar 8, 2023

alonbehaim commented Mar 8, 2023

alonbehaim commented Mar 15, 2023

alexef commented Mar 31, 2023

zachaller commented Mar 31, 2023 • edited Loading

zachaller commented Mar 31, 2023

zachaller commented Mar 31, 2023

zachaller commented Apr 7, 2023

gdvalle commented Apr 10, 2023

zachaller commented Apr 28, 2023

triplewy commented Mar 14, 2024

Laakso commented May 2, 2024

DimKanoute commented May 27, 2024 • edited Loading

vadasambar commented Oct 4, 2024 • edited Loading

vadasambar commented Oct 4, 2024

vadasambar commented Oct 15, 2024

vadasambar commented Oct 15, 2024

zachaller commented Oct 20, 2024

vadasambar commented Oct 21, 2024

vadasambar commented Oct 21, 2024

alonbehaim commented Jan 18, 2023 •

edited

Loading

zachaller commented Jan 18, 2023 •

edited

Loading

zachaller commented Jan 18, 2023 •

edited

Loading

zachaller commented Feb 17, 2023 •

edited

Loading

alonbehaim commented Feb 18, 2023 •

edited

Loading

alonbehaim commented Feb 27, 2023 •

edited

Loading

zachaller commented Feb 27, 2023 •

edited

Loading

alonbehaim commented Mar 6, 2023 •

edited

Loading

zachaller commented Mar 31, 2023 •

edited

Loading

DimKanoute commented May 27, 2024 •

edited

Loading

vadasambar commented Oct 4, 2024 •

edited

Loading