Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ghcr.io: push times out unreproducible #251

Closed
junghans opened this issue Dec 11, 2020 · 58 comments
Closed

ghcr.io: push times out unreproducible #251

junghans opened this issue Dec 11, 2020 · 58 comments

Comments

@junghans
Copy link

From https://github.com/votca/buildenv/runs/1534243887?check_suite_focus=true, I am getting a lot on unreproducible error when pushing a container:

#15 pushing layers 24.1s done
#15 pushing manifest for registry.gitlab.com/votca/buildenv/ubuntu:18.04
#15 pushing manifest for registry.gitlab.com/votca/buildenv/ubuntu:18.04 1.5s done
#15 pushing layers 631.9s done
#15 ERROR: failed commit on ref "layer-sha256:49507e21bb5d0cee4bce2fee3248063d1eff846974de5da2d70f7edc07af9181": no response
------
 > exporting to image:
------
failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:49507e21bb5d0cee4bce2fee3248063d1eff846974de5da2d70f7edc07af9181": no response
Error: buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:49507e21bb5d0cee4bce2fee3248063d1eff846974de5da2d70f7edc07af9181": no response

Never happened with v1.

@crazy-max
Copy link
Member

@junghans Can you change the following step and let me know? Thanks:

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
        with:
          driver-opts: image=moby/buildkit:buildx-stable-1
          buildkitd-flags: --debug

@junghans
Copy link
Author

2020-12-11T16:24:18.7837214Z #15 ERROR: failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out
2020-12-11T16:24:18.7841220Z ------
2020-12-11T16:24:18.7842055Z  > exporting to image:
2020-12-11T16:24:18.7843039Z ------
2020-12-11T16:24:18.7847131Z failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out
2020-12-11T16:24:18.9082160Z ##[error]buildx call failed with: failed to solve: rpc error: code = Unknown desc = failed commit on ref "layer-sha256:dbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c": failed to do request: Put https://ghcr.io/v2/votca/buildenv/ubuntu/blobs/upload/ac1e0dc5-5197-4415-a443-fb7411ef858a?digest=sha256%3Adbd68bb68d01ef8538e311584c0da7eebdcbab63cefa502035c8e52fb7eae24c: read tcp 172.17.0.2:49118->140.82.113.33:443: read: connection timed out

From https://github.com/votca/buildenv/runs/1538301715?check_suite_focus=true

@crazy-max
Copy link
Member

@junghans Looks like an issue with GHCR (cc @clarkbw).

@clarkbw
Copy link

clarkbw commented Dec 11, 2020

We'll take a look but someone might not get to it until Monday.

@davidski

This comment has been minimized.

@crazy-max

This comment has been minimized.

@markphelps
Copy link

markphelps commented Dec 14, 2020

@junghans I took a look on our end in the GHCR logs and it looks like your layer was uploaded successfully.

GHCR received the request at 2020-12-11T16:13:52Z and it looks like it was completed by us at 2020-12-11T16:14:27Z

@crazy-max Is there some kind of 30s timeout in the request that buildx uses when pushing a layer? The presence of no response and read: connection timed out leads me to believe this is a client side timeout?

@crazy-max
Copy link
Member

crazy-max commented Dec 15, 2020

@markphelps Yes there is a default timeout of 30s but this one looks like a networking issue.

@junghans Can you try with the latest stable release of buildx please? (v0.5.1 atm):

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
        with:
          version: latest
          buildkitd-flags: --debug

@junghans
Copy link
Author

Different image, but still fails:

2020-12-15T13:52:52.6570268Z #16 [auth] votca/buildenv/fedora:pull,push token for ghcr.io
2020-12-15T13:52:52.6571975Z #16 sha256:e8541c1073feec4e25a6fc2c2959aa1538edc7e6d2dfb710d10730ded1df9309
2020-12-15T13:52:52.6573387Z #16 DONE 0.0s
2020-12-15T13:52:52.8064217Z 
2020-12-15T13:52:52.8066189Z #14 exporting to image
2020-12-15T13:52:52.8067223Z #14 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
2020-12-15T14:03:30.9576987Z #14 pushing layers 638.5s done
2020-12-15T14:03:30.9581863Z #14 ERROR: failed commit on ref "layer-sha256:7d57542efffe8baa32825f71430bd49f142a0ac4e6fdc42c77b0e654f0381ac8": failed to do request: Put https://ghcr.io/v2/votca/buildenv/fedora/blobs/upload/e0620cbc-8822-4325-bac6-5d00317c0d1b?digest=sha256%3A7d57542efffe8baa32825f71430bd49f142a0ac4e6fdc42c77b0e654f0381ac8: read tcp 172.17.0.2:44330->140.82.112.33:443: read: connection timed out

(https://github.com/votca/buildenv/runs/1557317848?check_suite_focus=true)

@crazy-max
Copy link
Member

@junghans Ok I think I got it, that's because you can only have a single namespace with GHCR. So ghcr.io/votca/buildenv/fedora:latest will not work here. Try instead with ghcr.io/votca/buildenv-fedora:latest. See #115 (comment)

@junghans
Copy link
Author

Huh? It works with build-push-action@v1 and the image shows up here: https://github.com/orgs/votca/packages/container/package/buildenv%2Ffedora

@clarkbw
Copy link

clarkbw commented Dec 15, 2020

So ghcr.io/votca/buildenv/fedora:latest will not work here. Try instead with ghcr.io/votca/buildenv-fedora:latest

Odd, this should work with GHCR. We don't have a limit on namespaces.

@crazy-max
Copy link
Member

Odd, this should work with GHCR. We don't have a limit on namespaces.

My bad there is no limit actually yes!

@junghans
Copy link
Author

Any news on this?

@crazy-max
Copy link
Member

@junghans I've made some tests with several options linked to buildx and containerd and the following combination works for me (version: latest is buildx v0.5.1 atm):

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
        with:
          version: latest
          buildkitd-flags: --debug

image

@junghans
Copy link
Author

Still failing: https://github.com/votca/buildenv/actions/runs/430519742, is there anything else you changed?

@crazy-max
Copy link
Member

@junghans Nothing special that could change the current behavior actually 🤔

@junghans
Copy link
Author

Could it be a raise condition as all job run simultaneously?

@crazy-max
Copy link
Member

@junghans I don't think so but can you set max-parallel: 4 in your job's strategy?

@junghans
Copy link
Author

junghans commented Dec 18, 2020

Actually that worked!?

@crazy-max
Copy link
Member

crazy-max commented Dec 18, 2020

@junghans That's strange, maybe an abuse prevention on bandwidth usage on GitHub Container Registry. Your images are quite huge (fedore:intel ~10GB). WDYT @markphelps?

@junghans
Copy link
Author

Yeah, the intel compiler is just huge.....

@junghans
Copy link
Author

Now it fails even when running sequentially: https://github.com/votca/buildenv/runs/1609439618?check_suite_focus=true

@koppor
Copy link

koppor commented Jan 4, 2021

Same issue here with a 2.2GB image. I might be mistaken, but isn't rsync a possible technical solution for a transfer? In a bash world, I would use rsync -avz --progress --partial in a a look (see also https://serverfault.com/a/98750/107832). There are also BSD-licensed implementations available: https://github.com/kristapsdz/openrsync. Think, this could be a way for transferring large files reliably and without wasting bandwidth at retries.

@markphelps
Copy link

@junghans That's strange, maybe an abuse prevention on bandwidth usage on GitHub Container Registry. Your images are quite huge (fedore:intel ~10GB). WDYT @markphelps?

There shouldn't be an issue with abuse prevention or bandwidth usage but I will look into this more and @koppor 's issue as well tomorrow (just got back from vacation today).

@koppor do you have an example of your push to ghcr failing I can see?

@tomy0000000
Copy link

I'm building many images with matrix, and having a similar issue when pushing the exported image to ghcr.io, even with max-parallel.

My action runs automatically upon commit, and I'll manually tag the commit to kickoff another workflow for pushing the image. Therefore, I'm certain this has got to be the issue with the ghcr.io server.

Notice that the following three commits runs on an identical dockerfile, and the timeout issue isn't always reproducible.

wo/ Push w/ Push
max-parallel: 2 Action Action
max-parallel: 4 Action Action
without max-parallel Action Action

tomy0000000 added a commit to tomy0000000/images that referenced this issue Mar 2, 2021
@markphelps
Copy link

Sorry for the radio silence. We've made some changes in ghcr.io that should hopefully help alleviate these concurrency issues. Would you mind letting me know if your parallel builds succeed?

gerhard added a commit to rabbitmq/rabbitmq-server that referenced this issue Mar 30, 2021
@hiaselhans
Copy link

hi @markphelps

ran into the same problem today uploading a rather big but not huuge (~2GB) image to ghcr:
https://github.com/airgproducts/llvm/runs/2335603402?check_suite_focus=true

@crazy-max crazy-max changed the title v2: push times out unreproducible ghcr.io: push times out unreproducible Apr 18, 2021
@jessfraz
Copy link

as a stop gap is there any way to have a retry build on error functionality?

@crazy-max
Copy link
Member

Hi @jessfraz, is it about this workflow?

@tonistiigi
Copy link
Member

as a stop gap is there any way to have a retry build on error functionality?

There is a push retry in latest buildkit https://github.com/moby/buildkit/blob/master/util/resolver/retryhandler/retry.go#L51

The original "no response" error reported was fixed in containerd/containerd#4724

@jessfraz
Copy link

Ah sorry its a private repo, also in looking more into this, it is not "no response"

Basically 1 in 50 error with either 403 forbidden or 503 service error, happy to make a different issue, maybe a race or something?

Screen Shot 2021-04-20 at 9 08 32 AM

Screen Shot 2021-04-20 at 9 08 22 AM

Screen Shot 2021-04-20 at 9 08 08 AM

@tonistiigi
Copy link
Member

503 retry handler was added recently. It is in master but not in the latest release yet.

403 is weird though and could be misbehaving registry. Problems with auth should return 401.

@crazy-max
Copy link
Member

@tonistiigi

403 is weird though and could be misbehaving registry. Problems with auth should return 401.

Yes agree with that.

@jessfraz Do you use the GITHUB_TOKEN or a PAT to login to ghcr?

@markphelps
Copy link

markphelps commented Apr 20, 2021

hi @markphelps

ran into the same problem today uploading a rather big but not huuge (~2GB) image to ghcr:
https://github.com/airgproducts/llvm/runs/2335603402?check_suite_focus=true

@hiaselhans

Upon looking for your request that errored out in your linked workflow run, I see this error:

 #26 ERROR: failed commit on ref "layer-sha256:db2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953": failed to do request: Put https://ghcr.io/v2/airgproducts/llvm/blobs/upload/514e25f9-7efa-421b-9a2a-acbabde7bbdd?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953: read tcp 172.17.0.2:48072->140.82.114.33:443: read: connection timed out
------

This still makes me think this is a client side (buildx) timeout issue. I looked in our load balancer logs for this request to upload this layer, and see that it (our server) did infact respond with a 201 Created status:

query":"?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953","recv_timestamp":"2021-04-13T19:08:43+0000","status":201,"ta":19130,"tc":0,"td":0,"term_state":"----","th":0,"ti":0,"tq":0,"tr":19130,"trb":0,"tt":19130,"tw":0,"type":"syslog","ubytes":223072382,"uri":"/v2/airgproducts/llvm/blobs/upload/514e25f9-7efa-421b-9a2a-acbabde7bbdd?digest=sha256%3Adb2b5fcf76f8fb16af3c060cac85adef71b061210be3618e2738f388f6d85953","user":"-","user_agent":"containerd/1.4.0+unknown"}

Note fields tt: 19130 and tr: 19130 which are total time in milliseconds elapsed between the accept and the last close and total time in milliseconds spent waiting for the server to send a full HTTP response, not counting data respectively.

So it looks like the server (GHCR) spent 19 seconds before returning the 201 response for that layer upload. This leads me to think there is an issue client side, timing out if it does not receive a response in a certain period of time or perhaps a networking issue (on which side I do not know yet)

@jessfraz
Copy link

jessfraz commented Apr 20, 2021

Do you use the GITHUB_TOKEN or a PAT to login to ghcr?

@crazy-max I am using a PAT but can switch to GITHUB_TOKEN since I turned that on as well!

@tonistiigi
Copy link
Member

So it looks like the server (GHCR) spent 19 seconds before returning the 201 response for that layer upload

Buildkit has 30 sec timeout for tcp and 10 for tls https://github.com/moby/buildkit/blob/master/util/resolver/resolver.go#L190-L195 that matches the golang recommendation https://github.com/golang/go/blob/52bf14e0e8bdcd73f1ddfb0c4a1d0200097d3ba2/src/net/http/transport.go#L42-L53 . I guess we could increase it a bit if we think it is related but having some timeout is needed to not just keep build hanging and return errors when they are valid.

@markphelps
Copy link

I would have thought the TLS handshake would have already occurred though, which timeout do you think it would be? Its possible we also may not be sending an expected header value for keep-alive?

@tonistiigi
Copy link
Member

@markphelps It's not tls timeout as that error is different https://github.com/golang/go/blob/master/src/net/http/transport.go#L2841

@markphelps
Copy link

@jessfraz we have been experiencing some network issues the past couple days with GHCR, but think we have resolved them. Would you please let me know if you are still experiencing intermittent 403/503s?

@lucacome
Copy link

I think it's related (if not I can open another issue), for the last few builds I've been seeing a connection reset by peer and seems like it just doesn't want to work no matter how many times I restart it

#21 exporting to image
#21 10.19 error: failed to copy: failed to do request: Put "ghcr.io/v2/***/nginx-ubi/blobs/upload/7a8c69f8-1810-461e-b06c-eff124697f60?digest=sha256%3A95f3a65224747afb7ca844ff6cc345d49ddc569c8f2be7566bbcece0cbdcb4b2": write tcp 172.17.0.2:34150->140.82.112.34:443: write: connection reset by peer
#21 10.19 retrying in 1s

@crazy-max
Copy link
Member

@lucacome See docker/buildx#834 (comment), that might be linked.

@shepmaster
Copy link

for the last few builds I've been seeing a connection reset by peer

I've been experiencing that same problem for about a month now. I've even opened an issue with GitHub support about it. They changed something in an attempt to fix it, but it unfortunately hasn't worked yet.

@South-Paw
Copy link

South-Paw commented Nov 13, 2021

Have also just started experiencing connection reset by peer and broken pipe errors on our builds. Haven't had any issues up until today with GitHub actions, last successful builds were 7 days ago. (docker/build-push-action#498)

However, AFAIK, the containers I'm working with are not as large as these ones in size (could be mistaken though)

Possibly related to: containerd/containerd#6242 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests