-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"unable to clone: SSH could not read data: Error waiting on socket" after updating to v0.19 #293
Comments
Can you trace the application while it is in a "locked state", as I described here: #209 (comment)? In case of concerns about sensitive data, you can send it to |
I can send it to you, but could you please describe what type of sensitive data is there? I hope there is no dump of k8s secrets or something in there :) |
It tells us what the application is doing at the moment the snapshot is taken. Nothing sensitive directly, and you can check the contents yourself using Looking at a 10s snapshot I took myself however, it may be more useful to have a 30 or 60 second version (by changing the GET parameter value in the URL). |
Thanks for details, I have sent you the traces with 60 seconds. Do you by any chance have an idea what can we do in a meantime (any workaround)? |
I have started to see this error from time to time as well (don't know if it is relevant or not to this case, but for the sake of completeness will post it here):
|
I have now downgraded to 0.15.0 and it seems to work fine again! I don't know why it did not work before, perhaps bitbucket had issues at just that time.
So it seems very likely that one of the versions after 0.15.0 introduced something that causes this behavior. |
I have updated to the latest flux v0.26.0 and image-automation-controller v0.20.0. The issue seems to fixed, but instead I am getting this error in the logs and as an alert in slack:
Everything seems to work though, just generating some spam |
Any signs it has to e.g. reconcile twice before it succeeds? |
How do I detect that? Same reconciliation notification twice in slack e.g.? I don't see anything like that, but just noticed this error does not happen for each of the reconciliations, some go through just fine |
Try increasing the timeout in your GitRepository, I guess that error hides a timeout, by default it's set to 20s and that may be too low for your setup. |
Thanks, I will try it this way:
Do you know if interval that is equal to timeout can cause any troubles or it looks fine this way? |
I would set the timeout to |
Having a timeout higher than the interval does theoretically not really matter because it is used to schedule the next reconciliation while completing the reconcile of the resource, which means that e.g. the following would just result in a reconciliation every 6 minutes in the worst case scenario: spec:
interval: 1m
timeout: 5m |
Thanks for help! I'll run it for some time with this set up and see what happens |
Looks good so far! |
I think we need to change the default GitRepository timeout in source-controller, |
We've been running Flux v0.24.1 / image-automation-controller v0.18.0 for a little while now and just started to see this as well. I did notice this is intermittent, but it did put my ImageUpdateAutomation object into a non-ready state and never recovered:
After killing the pod, it restarted and went back into a ready state after one initial failure. I had updated the interval and increased the timeout on the
|
You can try updating to 0.26.0, that helped a lot (at least in my case) + the timeout |
This issue had been resolved for us up through Flux v0.27.2, but after updating to Flux v0.27.3 / image-automation-controller v0.20.1, we're seeing this issue intermittently once again. Could be related to the fix for #316? |
I have updated flux from version 0.18 to the latest 0.25.2 which included update of image-automation-controller from 0.15 to 0.19.
After that there was one successful commit to the repository. After that commit everything stopped and killing the pod does not help. It is still the same message all the time. I have turned on debug level but it did not provide anything more useful than this error:
I have tried to downgrade to 0.15 and then a weird thing happened: all the commits went through, but right after that a similar error appeared:
Source controller does not have this trouble. I am using
Bitbucket Cloud
. I have seen a similar error posted elsewhere, but that one seems to be resolved by restarting the pod each time while I am not so lucky, so it's probably best to have it as a separate error.Is there any workaround I can use, automation all but stopped for us because of it.
The text was updated successfully, but these errors were encountered: