Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout got stuck #2522

Closed
1 of 2 tasks
alonbehaim opened this issue Jan 18, 2023 · 40 comments · Fixed by #2689
Closed
1 of 2 tasks

Rollout got stuck #2522

alonbehaim opened this issue Jan 18, 2023 · 40 comments · Fixed by #2689
Labels
bug Something isn't working

Comments

@alonbehaim
Copy link

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug
From time to time we're facing issue that when we're deploying new version and argo rollout need to rollout new pods, it just creating the new RS and then got stuck, when promoting manually the rollouts or restarting argo-rollout controller it fix the issue.

To Reproduce
N/A

Expected behavior
Rollout will proceed with deployment of changes even if something else edited the rollout object

Version
1.3.2

Logs
Attached both full log of controller and for the rollouts called my-first-app:
full-controller-logs.csv
my-first-app-logs.csv


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@alonbehaim alonbehaim added the bug Something isn't working label Jan 18, 2023
@zachaller
Copy link
Collaborator

So two questions are you running argo rollouts in HA mode aka do you have two controllers running at once? Second, do you have any other controller or opa like agent that makes modifications to the rollout object, because based on the logs the rollouts controller seems to be fighting with another controller.

@alonbehaim
Copy link
Author

@zachaller we followed the recommendations from #1904 and we scaled down the controller to be 1 replica, this is the only argo rollouts controller that running in the cluster, we have the same setup over multiple clusters and this issue reproduced in all of them.
We also validated during the time that it got stuck that the last change performed by argo rollouts controller itself, in addition we validated that no other controller that can edit argo rollouts.
As part our investigation we also tried to pass this argument --rollout-threads=1 but it still happened.

We considered few things that can might cause it, maybe you will be able to shed light on it:

  1. Is it possible that ArgoCD applying some change at the same moment that rollout start to apply the change? foreach deployment spec.strategy.canary.analysis.args[4].value changed in rollout object, attached rollout manifest rollout.yaml
  2. Can it be related to the fact that argo rollouts updating the RS with canary/stable annotations and labels?
  3. We see the following error also when we're not rolling out new version and after few moments it stopped, is it normal thing?
rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps "automations-7bf7b986f4": the object has been modified; please apply your changes to the latest version and try again

@zachaller
Copy link
Collaborator

So the only time I have generally seen this is when there is some admission controller or something else fighting with rollouts controller. If you have this happen every time and are able to reproduce consistently I wonder if it is not something related to that as in the 1904 case it ended up being laceworks replicasets could also be being modified by other controllers. Rollouts will retry in those cases generally as long as something is not always updating the rollout underneath it.

@alonbehaim
Copy link
Author

alonbehaim commented Jan 18, 2023

@zachaller the issue is not consistent at all, it happens once in 10-20 deployments and unfortunately we can't reproduce it, anything you suggest to check next time it happens?

For question number 3, is it normal that we see this error even many times during the day even that all good and everything is working as expected?

@zachaller
Copy link
Collaborator

zachaller commented Jan 18, 2023

For question number 3, is it normal that we see this error even many times during the day even that all good and everything is working as expected?

Yes, it can be normal and rollouts should retry updating the rs as seen here what happened is something updated the replicaset it could be an hpa or some other controller even the built in rs controller. Rollouts controller itself could also do it but I think the controller is pretty good about not clobbering itself with any threads etc.

@zachaller
Copy link
Collaborator

zachaller commented Jan 18, 2023

Could I also have you try upgrading to 1.4, there was some changes to leader election that actually might fix the issue by auto restarting the controller on issues such as k8s api failures which could maybe cause it.

@alonbehaim
Copy link
Author

thanks @zachaller, we will upgrade to v1.4 and check if it's solving the issue

@zachaller
Copy link
Collaborator

@alonbehaim did you see any improvement on v1.4?

@alonbehaim
Copy link
Author

@zachaller unfortunately we see the same behavior also with v1.4.
Attached logs from today with rollout my-app-2-transaction-recovery
rollout_logs.csv

@alonbehaim
Copy link
Author

@zachaller any idea how we can continue debug this issue?

@zachaller
Copy link
Collaborator

zachaller commented Feb 17, 2023

So I still think this is something else modifying the rollout object outside of the rollouts controller, I have a possible patch that might get us some more info and or possible fix the issue. This should not be needed in the case no other controller modifies the ro object right away because even if we do get a conflict we do eventually retry this will just trigger it to happen sooner. Are you comfortable building this patch into a custom version and running it.

diff --git a/rollout/controller.go b/rollout/controller.go
index 5b2a4eb9..58d62c71 100644
--- a/rollout/controller.go
+++ b/rollout/controller.go
@@ -415,8 +415,14 @@ func (c *Controller) syncHandler(ctx context.Context, key string) error {
        err = roCtx.reconcile()
        if roCtx.newRollout != nil {
                c.writeBackToInformer(roCtx.newRollout)
+               logCtx.Warnf("newRollout written to informer '%s': %w", roCtx.newRollout.Name, err)
        }
        if err != nil {
+               if k8serrors.IsConflict(err) {
+                       logCtx.Warnf("Retrying updating of Rollout '%s': %w", roCtx.newRollout.Name, err)
+                       c.enqueueRolloutAfter(roCtx.newRollout, time.Second)
+                       return nil
+               }
                logCtx.Errorf("roCtx.reconcile err %v", err)
        }
        return err

@alonbehaim
Copy link
Author

alonbehaim commented Feb 18, 2023

@zachaller thanks!
I created this PR #2607 with the change so test will be run as well, we will update it in the next few days and I'll let you know what was the result

@alonbehaim
Copy link
Author

@zachaller today we finished with upgrading all argo rollouts instances we have, so far looks good but it sometimes happened once a week so we still need to wait in order to confirm it solving the issue.

We see this new log sometime printed when there is no error

newRollout written to informer 'my-app-1': <nil>

I think it will be more helpful to check if there is an err before printing this log

        err = roCtx.reconcile()
        if roCtx.newRollout != nil {
                c.writeBackToInformer(roCtx.newRollout)
+               if err != nil {
+                        logCtx.Warnf("newRollout written to informer '%s': %w", roCtx.newRollout.Name, err)
+               }
        }
        if err != nil {
+               if k8serrors.IsConflict(err) {
+                       logCtx.Warnf("Retrying updating of Rollout '%s': %w", roCtx.newRollout.Name, err)
+                       c.enqueueRolloutAfter(roCtx.newRollout, time.Second)
+                       return nil
+               }
                logCtx.Errorf("roCtx.reconcile err %v", err)
        }
        return err

wdyt?

@zachaller
Copy link
Collaborator

@alonbehaim Yea I missed that that is a good check to add to reduce noise.

@alonbehaim
Copy link
Author

alonbehaim commented Feb 27, 2023

@zachaller yesterday we faced hanging rollout, it happened after new rollout started while the rollout was in the middle of canary rollout of previous deployment.
I saw this error all over the place, I guess it's related to recent changes:

logs
Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 175387 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c24a0, 0x3c11ef0})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0004aca80, {0x29d9ad0, 0xc000051800}, {0xc003a8b480, 0x34})
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:424 +0x6c8
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e5b40?, 0xc0000a2640}, {0x25899bc, 0x7}, 0xc006e9de70, {0x29d9ad0, 0xc000051800}, 0x10000c0005fb5b8?, {0x2094760, 0xc0056f63a0})
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d9ad0, 0xc000051800}, {0x29e5b40, 0xc0000a2640}, {0x25899bc, 0x7}, 0x0?, 0xc0005fb720?)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0007581b0?)
	/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000760000?, {0x29b99e0, 0xc0052dc210}, 0x1, 0xc000648e40)
	/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000e8a7e0?, 0x3b9aca00, 0x0, 0x0?, 0x29c2b70?)
	/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0005fb790?, 0xc000e8a7e0?, 0xc0005fb7a0?)
	/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7

full_logs.csv

@zachaller
Copy link
Collaborator

zachaller commented Feb 27, 2023

ahh roCtx.newRollout was nil which is valuable information because it let's us know that something in reconcile failed to create a newRollout object. We should however also try requeue using the old roCtx.rollout instead of newRollout when we see that conflict.

@zachaller
Copy link
Collaborator

also one other good bit of info is this log line Error: updating rollout revision that only happens in two spots in rollouts/sync.go the one on line 106 is a bit interesting to me, I will look into that one a bit more as well.

@alonbehaim
Copy link
Author

alonbehaim commented Mar 6, 2023

Happened again, I didn't find something new there.
This time it happened just when tried to update exist rollout that wasn't in the middle of other rollout.
new_logs.csv

@zachaller
Copy link
Collaborator

the logs you just posted where those with requeueing roCtx.rollout and not roCtx.newRollout

@alonbehaim
Copy link
Author

the logs you just posted where those with requeueing roCtx.rollout and not roCtx.newRollout

No it's still using roCtx.newRollout.
I wonder if I'll push the following change, it won't got stuck in the previous version instead promote the new one?

        err = roCtx.reconcile()
        if roCtx.newRollout != nil {
                c.writeBackToInformer(roCtx.newRollout)
+               if err != nil {
+                        logCtx.Warnf("newRollout written to informer '%s': %w", roCtx.newRollout.Name, err)
+               }
        }
        if err != nil {
+		if k8serrors.IsConflict(err) {
+			logCtx.Warnf("Retrying updating of Rollout '%s': %v", roCtx.newRollout.Name, err)
+			if roCtx.newRollout != nil {
+				c.enqueueRolloutAfter(roCtx.newRollout, time.Second)
+			} else {
+				c.enqueueRolloutAfter(roCtx.rollout, time.Second)
+			}
			return nil
		}
                logCtx.Errorf("roCtx.reconcile err %v", err)
        }
        return err

@zachaller
Copy link
Collaborator

c.enqueueRolloutAfter only uses the namespace/name of the rollout object so using just roCtx.rollout should be good enough to cover both cases but yes running using roCtx.rollout in the requeue is worth the test.

@alonbehaim
Copy link
Author

c.enqueueRolloutAfter only uses the namespace/name of the rollout object so using just roCtx.rollout should be good enough to cover both cases but yes running using roCtx.rollout in the requeue is worth the test.

Great, so tomorrow I'll apply following change with the logs so it will be easier to track what happened

        err = roCtx.reconcile()
        if roCtx.newRollout != nil {
                c.writeBackToInformer(roCtx.newRollout)
+               if err != nil {
+                        logCtx.Warnf("newRollout written to informer '%s': %w", roCtx.newRollout.Name, err)
+               }
        }
        if err != nil {
+		if k8serrors.IsConflict(err) {
+			if roCtx.newRollout != nil {
+				logCtx.Warnf("Retrying updating of newRollout '%s': %v", roCtx.newRollout.Name, err)
+				c.enqueueRolloutAfter(roCtx.newRollout, time.Second)
+			} else {
+				logCtx.Warnf("Retrying updating of Rollout '%s': %v", roCtx.rollout.Name, err)
+				c.enqueueRolloutAfter(roCtx.rollout, time.Second)
+			}
			return nil
		}
                logCtx.Errorf("roCtx.reconcile err %v", err)
        }
        return err

@alonbehaim
Copy link
Author

@zachaller we deployed the suggested fix from previous comment in our production, unfortunately we faced the same issue of hanging rollout.
logs.csv

@alexef
Copy link
Member

alexef commented Mar 31, 2023

we also ran into this today:

Mar 31, 2023 @ 11:24:53.009 | E0331 09:24:53.009565       1 controller.go:174] Operation cannot be fulfilled on replicasets.apps "booking-assistant-79dc857c5d": the object has been modified; please apply your changes to the latest version and try again
-- | --

  | Mar 31, 2023 @ 11:24:53.009 | time="2023-03-31T09:24:53Z" level=error msg="roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \"booking-assistant-79dc857c5d\": the object has been modified; please apply your changes to the latest version and try again" generation=26 namespace=booking-assistant resourceVersion=1650552507 rollout=booking-assistant

  | Mar 31, 2023 @ 11:24:53.009 | time="2023-03-31T09:24:53Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps \"booking-assistant-79dc857c5d\": the object has been modified; please apply your changes to the latest version and try again" namespace=booking-assistant rollout=booking-assistant

  | Mar 31, 2023 @ 11:24:52.872 | E0331 09:24:52.872526       1 controller.go:174] Operation cannot be fulfilled on replicasets.apps "booking-assistant-79dc857c5d": the object has been modified; please apply your changes to the latest version and try again

  | Mar 31, 2023 @ 11:24:52.872 | time="2023-03-31T09:24:52Z" level=error msg="roCtx.reconcile err Operation cannot be fulfilled on replicasets.apps \"booking-assistant-79dc857c5d\": the object has been modified; please apply your changes to the latest version and try again" generation=26 namespace=booking-assistant resourceVersion=1650552481 rollout=booking-assistant

rollback/deploy solved it. running latest 1.4

@zachaller
Copy link
Collaborator

zachaller commented Mar 31, 2023

@alexef yours seems slightly different in that it is with the replica set, which can happen for a few reasons such as HPA scaling, but retries should make that eventually go through it sounds like it got stuck instead. Did you also see a log like this where it failed on the rollout itself roCtx.reconcile err Operation cannot be fulfilled on rollouts.argoproj.io

And do you also see a log line from
logCtx.Infof("%s syncHandler queue retries: %v : key \"%v\"", objType, workqueue.NumRequeues(key), key)

@zachaller
Copy link
Collaborator

I am wondering if this diff would fix it.

diff --git a/rollout/controller.go b/rollout/controller.go
index 1fa0c73c..29586ebf 100644
--- a/rollout/controller.go
+++ b/rollout/controller.go
@@ -413,13 +413,14 @@ func (c *Controller) syncHandler(ctx context.Context, key string) error {
        }
 
        err = roCtx.reconcile()
-       if roCtx.newRollout != nil {
-               c.writeBackToInformer(roCtx.newRollout)
-       }
        if err != nil {
                logCtx.Errorf("roCtx.reconcile err %v", err)
+               return err
        }
-       return err
+       if roCtx.newRollout != nil {
+               c.writeBackToInformer(roCtx.newRollout)
+       }
+       return nil
 }
 
 // writeBackToInformer writes a just recently updated Rollout back into the informer cache.

The output code looking like this

        err = roCtx.reconcile()
	if err != nil {
		logCtx.Errorf("roCtx.reconcile err %v", err)
		return err
	}
	if roCtx.newRollout != nil {
		c.writeBackToInformer(roCtx.newRollout)
	}
	return nil

The theory behind this patch is this, we get an error from the rollouts reconcile loop which is this function call roCtx.reconcile() then in the code currently on master we would go and update the informer cache manually as long as roCtx.newRollout != nil .

If we go look at function syncReplicaSetRevision we call c.setRolloutRevision which if it completes which I think it generally should will initialize c.newRollout = updatedRollout. Then what happens is back in syncReplicaSetRevision we try to update the status updatedRollout, err := c.argoprojclientset.ArgoprojV1alpha1().Rollouts(c.rollout.Namespace).UpdateStatus(ctx, c.rollout, metav1.UpdateOptions{}). This fails because something modified the rollout. Which kicks us back the the code in the patch where we now have a set roCtx.newRollout but an error from the reconcile. The state we update the informer cache with is now bad because its from failed update. I think this could possible cause a loop, it depends a bit on how the informer cache gets refreshed etc.

draft pr: #2689

@zachaller
Copy link
Collaborator

@alonbehaim Would you be willing to try out the new code?

@zachaller
Copy link
Collaborator

@alonbehaim any update?

@gdvalle
Copy link

gdvalle commented Apr 10, 2023

@zachaller I tried the patch and was still met with a stuck rollout. We have a gatekeeper agent performing some mutations and argo seems to reliably hang at least once a week.

@zachaller
Copy link
Collaborator

@gdvalle This particular patch would not help the case of some external system messing with the rollout resource. We would probably have to do something else to help that case.

@triplewy
Copy link

@zachaller So if I understand correctly, if we have an HPA that scales a rollout, this will come into conflict with the argo-rollouts controller? Because our setup is simple where we have a rollout that can be scaled by an HPA and we occassionaly get the error: rollout syncHandler error: error updating replicaset : Operation cannot be fulfilled on replicasets.apps \"xxx": the object has been modified; please apply your changes to the latest version and try again" namespace=xxx rollout=xxx

@Laakso
Copy link

Laakso commented May 2, 2024

@triplewy This seems to happen with us as well with the HPA.

@DimKanoute
Copy link

DimKanoute commented May 27, 2024

Hey, coming here to raise that issue and try to open the discussion again. We face the exact same issue with HPA.

@vadasambar
Copy link

vadasambar commented Oct 4, 2024

Can confirm this problem happens in v1.7.1
We are using rollouts with HPA but I don't think it is related to HPA.
Rollouts controller seems to miss the new replicaset update and never really gets the new replicaset update until the pod is restarted.
image
... repeats until
image

@vadasambar
Copy link

You can see the rollouts controller is fighting with itself ^

@vadasambar
Copy link

/re-open

@vadasambar
Copy link

@alonbehaim can we re-open this issue. I don't think it is fixed yet.

@zachaller
Copy link
Collaborator

I am aware of this issue, there are other improvments coming in 1.8 as well as some threads in slack where we are discussing the informer cache being stale and possible solutions that should make it into 1.8

@vadasambar
Copy link

@zachaller thank you for the update. Looking forward to 1.8. I did skim through #argo-rollouts on CNCF slack and community meeting docs but didn't find anything related to this (maybe I am looking at the wrong place or not looking closely enough).

@vadasambar
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants