Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copying system image from manifest list: writing/storing blob: happened during read: unexpected EOF #17193

Open
edsantiago opened this issue Jan 23, 2023 · 16 comments
Assignees
Labels
flakes Flakes from Continuous Integration

Comments

@edsantiago
Copy link
Member

New flake, two instances, both in the same PR at a similar clock time:

# podman-remote --url unix:/tmp/podman_tmp_0PCR pull quay.io/libpod/testimage:multiimage
Trying to pull quay.io/libpod/testimage:multiimage...
Getting image source signatures
Copying blob sha256:9afcdfe780b4ea44cc52d22e3f93ccf212388a90370773571ce034a62e14174e
Error: copying system image from manifest list: writing blob: storing blob to file "/var/tmp/storage4172392440/1": happened during read: unexpected EOF
[ rc=125 (** EXPECTED 0 **) ]

and

 podman [options] pull quay.io/libpod/systemd-image:20230106
         Trying to pull quay.io/libpod/systemd-image:20230106...
         Getting image source signatures
         Copying blob sha256:aa55548baa7f0ff25533b19213c56de832034f22989b884d6ca968c6151dd572
         Error: copying system image from manifest list: writing blob: storing blob to file "/var/tmp/storage3799877059/1": happened during read: unexpected EOF

(Am not tagging remote because even though the int failure says remote, the failure actually happened while pulling cache images, which I think is done using non-remote podman)

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jan 23, 2023
@flouthoc
Copy link
Collaborator

Saw multiple occurrence of this flake while working on this PR as well #16297

@vrothberg
Copy link
Member

vrothberg commented Jan 24, 2023

Looks like a registry flake to me. @mtrmac, I think we need to tweak catching EOFs during retry. I did not investigate any further than guessing that https://github.com/containers/common/blob/main/pkg/retry/retry.go#L76-L79 is not enough.

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 24, 2023

Yes; it’s already on a short-term list as https://issues.redhat.com/browse/RUN-1636 .

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 25, 2023

containers/image#1816 is WIP trying to improve this.

Note #17221 , filed early to collect Podman production data. If that doesn’t catch the flake (BODYREADER DEBUG in logs), it would be useful to keep retrying until it does.

@edsantiago
Copy link
Member Author

Just the last four days:

@edsantiago
Copy link
Member Author

Last five days

@vrothberg
Copy link
Member

Help is on the way. Let me vendor c/image directly now since @mtrmac's latest changes are expected to fix it.

vrothberg added a commit to vrothberg/libpod that referenced this issue Feb 7, 2023
To fix the registry flakes we see in containers#17193 making c/image more tolerant
toward unexpected EOFs from registries.

Also pin docker/docker to v20.10.23+incompatible as it doesn't compile
as imagebuilder needs to be fixed before.

Fixes: containers#17193
Signed-off-by: Valentin Rothberg <[email protected]>
@mtrmac
Copy link
Collaborator

mtrmac commented Feb 9, 2023

#17221 was merged; that should help but we do see failures that don’t fit into the current heuristic.

@edsantiago
Copy link
Member Author

Same failure, new error message seen (in f36 rootless), is this helpful?

Error: copying system image from manifest list: writing blob: 
      storing blob to file "/var/tmp/storage2897374505/1": happened during read:
       (heuristic tuning data: last retry 0, current offset 752828; 77.493 ms total, 0.629 ms since progress): unexpected EOF

@mtrmac
Copy link
Collaborator

mtrmac commented Feb 13, 2023

I don’t know what’s a good way to collect those “heuristic tuning data” messages… in principle they could show us something unexpected. (A real telemetry is … a thought.)

So far all of those have been something like the above: a failure very soon after starting to fetch data, less than a megabyte into the stream (a megabyte being the current cut-off). That’s probably a clear enough indication that we need to allow a retry in that case, so more instances of that kind of failure (last retry 0, small total, small “since progress”) would not make a difference. Other kinds of failure messages would still be interesting.

@mtrmac
Copy link
Collaborator

mtrmac commented Feb 13, 2023

That’s probably a clear enough indication that we need to allow a retry in that case

containers/image#1847 / #17492

@edsantiago
Copy link
Member Author

Here's one I don't know how to categorize: tls handshake failure. It's cdn03.quay, but not a DNS lookup error, so not #16973. If it doesn't belong here, lmk, and I'll shove it into #17288 ("Other").

not ok 538 pull image into a full storage
...
$ buildah --root=/tmp/buildah-test pull --signature-policy /usr/share/buildah/test/system/./policy.json alpine
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:latest...
parsing image configuration: Get "https://cdn03.quay.io/sha256/96/961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
     ?X-Amz-Algorithm=AWS4-HMAC-SHA256
     &X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230206%2Fus-east-1%2Fs3%2Faws4_request
     &X-Amz-Date=20230206T124402Z
     &X-Amz-Expires=600
     &X-Amz-SignedHeaders=host
     &X-Amz-Signature=ce946b86fe8265c8db2dc7695e0758d104a584b385f122433a02cd670bdc1fb6
     &cf_sign=cqeGovWnO5c8L7E3AnZxOTbW%2FTX01Zl%2BiePEnxn7BA7DseHg8wde58MdN3joVi43OlROpIN6RjWGAzn3t0PxI1GN6iTdINMSK7Ngh12EDjvgGLqjr%2FTyDZQ%2BU1Ppb%2BvRuvh%2BwbysezFTFasli%2BUInIVOXFFa9uGUb%2BdAv4FTp6Q3UI8EyHNzycTbcvUgESsX9MOAhu4gUyd3vWqpdNQWRzP1XPBENSfZyHjezr5MMH3A2DzdnTVgu0v8U7iQD06KAdL7ZBH06PXr0ALyOg9Xs7UqAzUpHctyTS9oyuHGbz9nPw3fPGY29Mn1r7PUNE6%2F6wIGNlfJQVsiYLTur73QjQ%3D%3D
     &cf_expiry=1675688042
     &region=us-east-1": remote error: tls: handshake failure

@mtrmac
Copy link
Collaborator

mtrmac commented Feb 14, 2023

From a not very careful check, I’d categorize that as “other”: It is the remote server successfully sending a “handshake failure” TLS alert record, i.e. it’s not an EOF we see from the server. (It may possible that the connection was disrupted in a fundamentally the same way, causing the server to see an EOF from us, and responding to that with a “handshake failure” alert. I’m not sure.)

@edsantiago
Copy link
Member Author

I'm seeing a lot of flakes now that say "502 Bad Gateway" instead of "happened during read":

Error: reading blob sha256:9d16cba9fb961d1aafec9542f2bf7cb64acfc55245f9e4eb5abecd4cdc38d749: fetching blob: received unexpected HTTP status: 502 Bad Gateway

Is this a new variant of this same issue? Is it #16973 (quay flakes)? Or should I file a new issue?

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 15, 2023

  • “happened during read” means we successfully connected to the registry, and got status 200; the failure happened while reading the response body.
  • “happened during read … unexpected EOF” then means the body ended prematurely (IIRC violating the Transfer-Encoding: Chunked framing, in particular)
  • This 502 Bad Gateway means the registry responded with this code, instead of status 200.

So, 502 is a different behavior. (We don’t see inside the registry to understand whether it’s the same cause, and it might not matter.)

For context, currently there are two levels of retries:

  • A c/common/pkg/retry. AFAICS that should, by default, retry each pull (as a whole) three times . There should be a warning-level message Failed, retrying in %s ... (%d/%d). Error: %v .
  • A c/image retry. This applies only to the “happened during read” failures, and it is restricted specifically to “unexpected EOF” and “connection reset”. (The idea is that the top-level c/common/pkg/retry drives retries at the top level, and lower-level libraries should not retry so that we don’t multiply the number of attempts; the “happened during read” failures are an exception because if we retry at the higher level, we need to redownload the whole layer, while the c/image retry can possibly continue where it left off.)

Looking at the code, it seems that a 502 doesn’t trigger the c/image retry (as it shouldn’t), but it also doesn’t trigger c/common/pkg/retry (because the error returned by c/image doesn’t have a recognizable Go type).

So, if the failures don’t trigger a Failed, retrying in warning, please file that separately; we should export the error type in c/image, and handle it in c/common.

(This is under the assumption that retrying could help the operation succeed. I’m not sure about that, nor how to reasonably measure that. Retrying will very likely increase the load on the server, possibly making the failure harder to recover from.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants