-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image-automation-controller CrashLoopBack >= 0.20.2 #339
Comments
@saltyblu thank you very much for reporting this. Would you be able to share with us the contents of the commands below:
And the over version of your other controllers via:
I noticed that you are not running on the latest version of the image automation controller. Would you be able to run against the latest version and see if the error persists please? |
I've upgraded the image-automation-controller back to the latest verison. I guess.
|
Image controller is in CrahsLoopBackOff:
Pod logs are still not very great:
|
Same problem. |
The image-automation-controller fails silently with an: Readiness probe failed: Get "http://10.6.0.61:9440/readyz": dial tcp |
@saltyblu thank you very much for the information. Would you mind dialling up the log verbosity to That can be configured on the controller's deployment arguments with
It would be great if you could also provide the results of: |
I did;
The service fails directly after trying to update an image tag :( without any logs. We are using an GKE Cluster, the GCP Registry.
I've updated and downgraded the service in this time perioud, we can't leave the service like this because it blocks our development :) |
@saltyblu Thanks again for the quick reply. We are investigating this and still trying to reproduce. Can you please provide more details about your cluster? What version of Kubernetes are you running on? It seems that the controller is working properly, but the probes are misbehaving for some reason in your setup. Do you mind disabling both |
Kubernetes Version is: v1.22.7-gke.300 with 6 nodes 62 Flux Kustomizations |
Do you have any additional |
We have 3 in place but no new ones there:
|
I've also disabled health and readiness probe, directly after triggering an image-update-automation the pod dies and gets recreated :( |
after some debugging the container seems to silently fail somwhere below that code line . . . var signingEntity *openpgp.Entity After debugging the controller with delve we are not able to get a "certain" line of code where the controller stops working. |
We are having an issue similar to this. Anytime there is a NEW change that needs to be committed the automation controller crashes and ends up in a crashbackloop until we suspend automation. This is the last line we get in our logs before crash: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L340 We DO NOT make it to this line: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L371 We think the failure is occurring around here since the Crash only happens when there is a new change: https://github.com/fluxcd/image-automation-controller/blob/main/controllers/imageupdateautomation_controller.go#L378 Images
Env
Amount Additional Information based on previous comments
|
We have recently merged some fixes on The version below is working quite well with most Git providers when using SSH, with the exception of If you could test ahead of the release and confirm the fixes/improves your issue that would be great.
|
@pjbgf Github - I reviewed the draft PR, we will get an image spun up and report back. Thanks for the quick update! |
Hi, Currently we are using only Github Enterprise as git provider. Thanks for the updates |
We used the updated image for half the day now and the issue seems not to be solved. The controller is able to update 1 or 2 ImageAutomations after a random time period the controller is failing silently again (means without any error logs) |
I can concur, I'm getting the exact same behavior. I also tried the @pjbgf container quay.io/paulinhu/image-automation-controller:selfheal@sha256:0c78d0ff5aec404c724b623e9d4ae00d5c0348320c397626900d1e073e1d7ac5 Same behavior. This was the result I saw
It's now cycling this last behavior, which is what I was seeing before changing image-automation-controller images |
We were seeing exactly this behaviour in one of our clusters, and it seemed like the pod wasn't able to burst high enough, so we increased the limits on the image-automation-controller to
and also increase the git source interval from 2m to 5m. We haven't had any crashes since then. NB this is only happening in the cluster that has multiple git sources, including "include" sources. The cluster that pull everything from a single git source is fine. |
Just to offer a second test, I'm using a single repo cluster and upped my resources like @annaken . Still a lot of crashes
But the images are eventually updated. So the image-automation-controller is still working, just not stable. |
We tested the image-automation-controller in a fully new created GKE Cluster with one additional repo beside the flux-sysrtem and it is still failing. In our case the controller often misses updates, we downgraded the controller back to 0.20.1 for now. :( |
I added some tracing and built an image using the above method. I seems to be silently crashing in
Specifically, i added these traces around it:
set the log-level to trace and the last log i got from it was as follows:
we never get to the "adding file" tracelog line in the if or to the additional traces i added after it. to note, .editorconfig is the 1st file in an ls of the git repo. We have reverted to 0.20.1 as well, and that seems stable so far. |
I've tested the new image-automation-controller already and this did not solve the problem. The controller still crashes |
dquote> Some users are experiencing some segfault issues in Flux controllers which seem to be related to internal git2go state processed within a background thread. Due to the shift into Managed Transport only, the concerns around multi-threading from a libgit2 seem to be less problematic than otherwise. #339 https://github.com/libgit2/libgit2/blob/main/docs/threading.md Signed-off-by: Paulo Gomes <[email protected]>
Some users are experiencing some segfault issues in Flux controllers which seem to be related to internal git2go state processed within a background thread. Due to the shift into Managed Transport only, the concerns around multi-threading from a libgit2 seem to be less problematic than otherwise. fluxcd/image-automation-controller#339 https://github.com/libgit2/libgit2/blob/main/docs/threading.md Signed-off-by: Paulo Gomes <[email protected]>
dquote> Some users are experiencing some segfault issues in Flux controllers which seem to be related to internal git2go state processed within a background thread. Due to the shift into Managed Transport only, the concerns around multi-threading from a libgit2 seem to be less problematic than otherwise. #339 https://github.com/libgit2/libgit2/blob/main/docs/threading.md Signed-off-by: Paulo Gomes <[email protected]>
Here's the release candidate image: |
For everybody else's benefit, two users experiencing this issue have already confirmed the RC version worked. So we will start working towards getting this merged. Based on the upstream dependencies this may take some time to make its way into the official releases. Please report any issues or confirm (:heavy_check_mark:) this is working for you. |
✔️ |
@pjbgf Thank you very much, Paulo. The RC version |
Looks good, the controller is still crashing sometimes but thats maybe an other topic.
|
@saltyblu thanks for sharing this. The log above looks like the end of a goroutines' dump. Do you have access to the entire logs which may provide further insights about the error? |
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd/image-automation-controller#339. Signed-off-by: Paulo Gomes <[email protected]>
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd/image-automation-controller#339. Signed-off-by: Paulo Gomes <[email protected]>
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <[email protected]>
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <[email protected]>
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <[email protected]>
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <[email protected]>
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <[email protected]>
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <[email protected]>
Use of MUSL was a temporary solution to mitigate cross-platform issues while building openssl and libssh2. Since Unmanaged transport has been deprecated, openssl and libssh2 dependencies are no longer required and by extension MUSL. Enables libgit2 threadless support and provides a regression assurance for fluxcd#339. Signed-off-by: Paulo Gomes <[email protected]>
Issue
Since upgrading to 0.20.2 the Image-Update-Automation Controller stopped working,
flux reconcile image update
runs foreverThe image-update-controller fails periodically.
/readyz is not returning 200, but also provides no infos.
The service provides no Logs for the failures.
How To reproduce
In our case: update image-automation-crontroller > 0.20.1
Trigger an image update
The text was updated successfully, but these errors were encountered: