Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"unable to clone: SSH could not read data: Error waiting on socket" after updating to v0.19 #293

Closed
ilya-git opened this issue Jan 17, 2022 · 19 comments · Fixed by fluxcd/source-controller#570

Comments

@ilya-git
Copy link

ilya-git commented Jan 17, 2022

I have updated flux from version 0.18 to the latest 0.25.2 which included update of image-automation-controller from 0.15 to 0.19.

After that there was one successful commit to the repository. After that commit everything stopped and killing the pod does not help. It is still the same message all the time. I have turned on debug level but it did not provide anything more useful than this error:

{
  "level": "error",
  "ts": "2022-01-17T14:09:25.041Z",
  "logger": "controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "my-gitops",
  "namespace": "flux-system",
  "error": "unable to clone: SSH could not read data: Error waiting on socket",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"
}

I have tried to downgrade to 0.15 and then a weird thing happened: all the commits went through, but right after that a similar error appeared:

{
  "level": "error",
  "ts": "2022-01-17T14:13:17.047Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "my-gitops",
  "namespace": "flux-system",
  "error": "unable to clone 'ssh://[email protected]/myorg/my-gitops', error: SSH could not read data: Error waiting on socket"
}

Source controller does not have this trouble. I am using Bitbucket Cloud. I have seen a similar error posted elsewhere, but that one seems to be resolved by restarting the pod each time while I am not so lucky, so it's probably best to have it as a separate error.

Is there any workaround I can use, automation all but stopped for us because of it.

@hiddeco
Copy link
Member

hiddeco commented Jan 17, 2022

Can you trace the application while it is in a "locked state", as I described here: #209 (comment)?

In case of concerns about sensitive data, you can send it to hidde <at> weave <dot> works.

@ilya-git
Copy link
Author

ilya-git commented Jan 17, 2022

I can send it to you, but could you please describe what type of sensitive data is there? I hope there is no dump of k8s secrets or something in there :)
I tried to check myself but it seems to be binary

@hiddeco
Copy link
Member

hiddeco commented Jan 17, 2022

The execution trace captures a wide range of execution events such as goroutine creation/blocking/unblocking, syscall enter/exit/block, GC-related events, changes of heap size, processor start/stop, etc. A precise nanosecond-precision timestamp and a stack trace is captured for most events. The generated trace can be interpreted using go tool trace.
https://pkg.go.dev/runtime/trace#hdr-Tracing_runtime_activities

It tells us what the application is doing at the moment the snapshot is taken. Nothing sensitive directly, and you can check the contents yourself using go tool trace <file>.out. I mostly include this message because people are often hesitant about sharing it in public.

Looking at a 10s snapshot I took myself however, it may be more useful to have a 30 or 60 second version (by changing the GET parameter value in the URL).

@ilya-git
Copy link
Author

Thanks for details, I have sent you the traces with 60 seconds. Do you by any chance have an idea what can we do in a meantime (any workaround)?

@ilya-git
Copy link
Author

ilya-git commented Jan 18, 2022

I have started to see this error from time to time as well (don't know if it is relevant or not to this case, but for the sake of completeness will post it here):

{
  "level": "error",
  "ts": "2022-01-18T09:18:14.258Z",
  "logger": "controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "my-gitops",
  "namespace": "flux-system",
  "error": "unable to clone: User"
}

@ilya-git
Copy link
Author

I have now downgraded to 0.15.0 and it seems to work fine again! I don't know why it did not work before, perhaps bitbucket had issues at just that time.

image: ghcr.io/fluxcd/image-automation-controller:v0.15.0

So it seems very likely that one of the versions after 0.15.0 introduced something that causes this behavior.

@ilya-git
Copy link
Author

ilya-git commented Feb 3, 2022

I have updated to the latest flux v0.26.0 and image-automation-controller v0.20.0. The issue seems to fixed, but instead I am getting this error in the logs and as an alert in slack:

unable to clone: User

Everything seems to work though, just generating some spam

@hiddeco
Copy link
Member

hiddeco commented Feb 3, 2022

Any signs it has to e.g. reconcile twice before it succeeds?

@ilya-git
Copy link
Author

ilya-git commented Feb 3, 2022

How do I detect that? Same reconciliation notification twice in slack e.g.? I don't see anything like that, but just noticed this error does not happen for each of the reconciliations, some go through just fine

@stefanprodan
Copy link
Member

Try increasing the timeout in your GitRepository, I guess that error hides a timeout, by default it's set to 20s and that may be too low for your setup.

@ilya-git
Copy link
Author

ilya-git commented Feb 3, 2022

Thanks, I will try it this way:

spec:
  interval: 1m
  timeout: 1m

Do you know if interval that is equal to timeout can cause any troubles or it looks fine this way?

@stefanprodan
Copy link
Member

I would set the timeout to 50s, always less than the interval.

@hiddeco
Copy link
Member

hiddeco commented Feb 3, 2022

Having a timeout higher than the interval does theoretically not really matter because it is used to schedule the next reconciliation while completing the reconcile of the resource, which means that e.g. the following would just result in a reconciliation every 6 minutes in the worst case scenario:

spec:
  interval: 1m
  timeout: 5m

@ilya-git
Copy link
Author

ilya-git commented Feb 3, 2022

Thanks for help! I'll run it for some time with this set up and see what happens

@ilya-git
Copy link
Author

ilya-git commented Feb 4, 2022

Looks good so far!

@stefanprodan
Copy link
Member

I think we need to change the default GitRepository timeout in source-controller, 20s it's way to low for libgit2.

@mbrancato
Copy link

mbrancato commented Feb 5, 2022

We've been running Flux v0.24.1 / image-automation-controller v0.18.0 for a little while now and just started to see this as well.

I did notice this is intermittent, but it did put my ImageUpdateAutomation object into a non-ready state and never recovered:

NAME                             	READY	MESSAGE                                                          	LAST RUN                 	SUSPENDED 
imageupdateautomation/mygit		False	unable to clone: SSH could not read data: Error waiting on socket	2022-02-03T13:43:54-05:00	False    

After killing the pod, it restarted and went back into a ready state after one initial failure. I had updated the interval and increased the timeout on the GitRepository but that didn't seem to help given the initial failure on startup. And after restarting, it pushed an image update to a git branch which I was expecting.

NAME                             	READY	MESSAGE                                                     	LAST RUN                 	SUSPENDED 
imageupdateautomation/mygit		True 	no updates made; last commit aa63386 at 2022-02-05T03:25:49Z	2022-02-04T22:25:52-05:00	False    

@ilya-git
Copy link
Author

ilya-git commented Feb 5, 2022

You can try updating to 0.26.0, that helped a lot (at least in my case) + the timeout

@gws
Copy link

gws commented Mar 2, 2022

This issue had been resolved for us up through Flux v0.27.2, but after updating to Flux v0.27.3 / image-automation-controller v0.20.1, we're seeing this issue intermittently once again. Could be related to the fix for #316?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants