-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollout got stuck #2522
Comments
So two questions are you running argo rollouts in HA mode aka do you have two controllers running at once? Second, do you have any other controller or opa like agent that makes modifications to the rollout object, because based on the logs the rollouts controller seems to be fighting with another controller. |
@zachaller we followed the recommendations from #1904 and we scaled down the controller to be 1 replica, this is the only argo rollouts controller that running in the cluster, we have the same setup over multiple clusters and this issue reproduced in all of them. We considered few things that can might cause it, maybe you will be able to shed light on it:
|
So the only time I have generally seen this is when there is some admission controller or something else fighting with rollouts controller. If you have this happen every time and are able to reproduce consistently I wonder if it is not something related to that as in the 1904 case it ended up being laceworks replicasets could also be being modified by other controllers. Rollouts will retry in those cases generally as long as something is not always updating the rollout underneath it. |
@zachaller the issue is not consistent at all, it happens once in 10-20 deployments and unfortunately we can't reproduce it, anything you suggest to check next time it happens? For question number 3, is it normal that we see this error even many times during the day even that all good and everything is working as expected? |
Yes, it can be normal and rollouts should retry updating the rs as seen here what happened is something updated the replicaset it could be an hpa or some other controller even the built in rs controller. Rollouts controller itself could also do it but I think the controller is pretty good about not clobbering itself with any threads etc. |
Could I also have you try upgrading to 1.4, there was some changes to leader election that actually might fix the issue by auto restarting the controller on issues such as k8s api failures which could maybe cause it. |
thanks @zachaller, we will upgrade to v1.4 and check if it's solving the issue |
@alonbehaim did you see any improvement on v1.4? |
@zachaller unfortunately we see the same behavior also with v1.4. |
@zachaller any idea how we can continue debug this issue? |
So I still think this is something else modifying the rollout object outside of the rollouts controller, I have a possible patch that might get us some more info and or possible fix the issue. This should not be needed in the case no other controller modifies the ro object right away because even if we do get a conflict we do eventually retry this will just trigger it to happen sooner. Are you comfortable building this patch into a custom version and running it.
|
@zachaller thanks! |
@zachaller today we finished with upgrading all argo rollouts instances we have, so far looks good but it sometimes happened once a week so we still need to wait in order to confirm it solving the issue. We see this new log sometime printed when there is no error
I think it will be more helpful to check if there is an err before printing this log
wdyt? |
@alonbehaim Yea I missed that that is a good check to add to reduce noise. |
@zachaller yesterday we faced hanging rollout, it happened after new rollout started while the rollout was in the middle of canary rollout of previous deployment. logsRecovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 175387 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c24a0, 0x3c11ef0})
/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0004aca80, {0x29d9ad0, 0xc000051800}, {0xc003a8b480, 0x34})
/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:424 +0x6c8
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e5b40?, 0xc0000a2640}, {0x25899bc, 0x7}, 0xc006e9de70, {0x29d9ad0, 0xc000051800}, 0x10000c0005fb5b8?, {0x2094760, 0xc0056f63a0})
/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d9ad0, 0xc000051800}, {0x29e5b40, 0xc0000a2640}, {0x25899bc, 0x7}, 0x0?, 0xc0005fb720?)
/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0007581b0?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000760000?, {0x29b99e0, 0xc0052dc210}, 0x1, 0xc000648e40)
/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000e8a7e0?, 0x3b9aca00, 0x0, 0x0?, 0x29c2b70?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0005fb790?, 0xc000e8a7e0?, 0xc0005fb7a0?)
/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7 |
ahh |
also one other good bit of info is this log line |
Happened again, I didn't find something new there. |
the logs you just posted where those with requeueing |
No it's still using
|
c.enqueueRolloutAfter only uses the namespace/name of the rollout object so using just roCtx.rollout should be good enough to cover both cases but yes running using |
Great, so tomorrow I'll apply following change with the logs so it will be easier to track what happened
|
@zachaller we deployed the suggested fix from previous comment in our production, unfortunately we faced the same issue of hanging rollout. |
we also ran into this today:
rollback/deploy solved it. running latest 1.4 |
@alexef yours seems slightly different in that it is with the replica set, which can happen for a few reasons such as HPA scaling, but retries should make that eventually go through it sounds like it got stuck instead. Did you also see a log like this where it failed on the rollout itself And do you also see a log line from |
I am wondering if this diff would fix it.
The output code looking like this
The theory behind this patch is this, we get an error from the rollouts reconcile loop which is this function call If we go look at function draft pr: #2689 |
@alonbehaim Would you be willing to try out the new code? |
@alonbehaim any update? |
@zachaller I tried the patch and was still met with a stuck rollout. We have a gatekeeper agent performing some mutations and argo seems to reliably hang at least once a week. |
@gdvalle This particular patch would not help the case of some external system messing with the rollout resource. We would probably have to do something else to help that case. |
@zachaller So if I understand correctly, if we have an HPA that scales a rollout, this will come into conflict with the argo-rollouts controller? Because our setup is simple where we have a rollout that can be scaled by an HPA and we occassionaly get the error: |
@triplewy This seems to happen with us as well with the HPA. |
Hey, coming here to raise that issue and try to open the discussion again. We face the exact same issue with HPA. |
You can see the rollouts controller is fighting with itself ^ |
/re-open |
@alonbehaim can we re-open this issue. I don't think it is fixed yet. |
I am aware of this issue, there are other improvments coming in 1.8 as well as some threads in slack where we are discussing the informer cache being stale and possible solutions that should make it into 1.8 |
@zachaller thank you for the update. Looking forward to 1.8. I did skim through #argo-rollouts on CNCF slack and community meeting docs but didn't find anything related to this (maybe I am looking at the wrong place or not looking closely enough). |
Nvm, found some threads around this: |
Checklist:
Describe the bug
From time to time we're facing issue that when we're deploying new version and argo rollout need to rollout new pods, it just creating the new RS and then got stuck, when promoting manually the rollouts or restarting argo-rollout controller it fix the issue.
To Reproduce
N/A
Expected behavior
Rollout will proceed with deployment of changes even if something else edited the rollout object
Version
1.3.2
Logs
Attached both full log of controller and for the rollouts called
my-first-app
:full-controller-logs.csv
my-first-app-logs.csv
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: