Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive: use pigz|zstd if available #1964

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Jun 10, 2024

use the command line tools when available as they are much faster than the Go libraries.

Especially for gzip, I've registered a 50% improvement

use zstd if available

the performance improvement is not as clear as with pigz, but it is
still measurable difference.

Copy link
Contributor

openshift-ci bot commented Jun 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhatdan
Copy link
Member

rhatdan commented Jun 10, 2024

LGTM
Very cool
Would need to put in a Recommends for Podman and Buildah

@rhatdan
Copy link
Member

rhatdan commented Jun 10, 2024

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the reported speedup on faith, sure, why not.

Tests are unhappy, I didn’t check the details of the failure.

pkg/archive/filter.go Outdated Show resolved Hide resolved
go func() {
defer w.Close()
f.runErr = cmd.Run()
pool.Put(input)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this function should be in the business of managing pool; let the caller worry about that, probably using NewReadCloserWrapper like on all the other paths.

(Uh… is the existing code in the caller safe? Shouldn’t it trigger on close of the input to the decompression, rather than the output? It’s not obvious to me that they always happen in the expected order.

If it is not safe, that’s also a reason for the caller to worry about that … and at least a separate commit.)

f := &filterReader{reader: r}
go func() {
defer w.Close()
f.runErr = cmd.Run()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This is a data race vs the consumer goroutine
  • I don’t immediately see that this filterReader is necessary: Isn’t it enough to do
var err = errors.New("internal error: should never be seen")
defer w.CloseWithError(err) // CloseWithErr(nil) == Close()
err = cmd.Run()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under what conditions will err ever be errors.New("internal error: should never be seen")?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic before Run returns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… and in this case that should be an uncaught panic and terminate the whole process, so it doesn’t matter very much what err is (I guess? I’m not sure whether processing a panic is concurrent with the pipe consumer goroutine running).

It might actually matter if there were a recover somewhere on the call stack.

For me, this is now just a pattern, to try very hard to ensure that a pipe is always closed, so that the consumer will never be left hanging without a visible reason.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… and it’s a belt-and-suspenders thing: this makes sure to avoid a hypothetical CloseWithError(nil) on the panic path — whether or not panic is relevant, the point is that the claim “the pipe is successfully closed only on success” can be easily verified.

pkg/archive/filter.go Outdated Show resolved Hide resolved
@cgwalters
Copy link
Contributor

One side effect of using pigz is definitely that for consumers this will result in unreproducible gzip output compared to the previous Go gzip. And pigz in particular almost always for not-small input produces different output. Not fatal, but something to be aware of, and I suspect there are some consumers that are partially relying on the gzip "semi-reproducible" status quo, where the build context isn't caching the previous state via diffids.

@giuseppe
Copy link
Member Author

One side effect of using pigz is definitely that for consumers this will result in unreproducible gzip output compared to the previous Go gzip. And pigz in particular almost always for not-small input produces different output. Not fatal, but something to be aware of, and I suspect there are some consumers that are partially relying on the gzip "semi-reproducible" status quo, where the build context isn't caching the previous state via diffids.

Thanks for the information, I didn't think of that problem and something to consider if we plan to use pigz also for the compressor. The current PR changes only the decompression side

@giuseppe giuseppe force-pushed the use-decompressor-filters-if-available branch 4 times, most recently from 2fc8de5 to 78cf545 Compare June 11, 2024 09:07
@giuseppe giuseppe changed the title [RFC] archive: use pigz|zstd if available archive: use pigz|zstd if available Jun 11, 2024
@giuseppe giuseppe force-pushed the use-decompressor-filters-if-available branch from 78cf545 to ebc2123 Compare June 11, 2024 10:01
@rhatdan
Copy link
Member

rhatdan commented Jun 11, 2024

I think the pull side is much more important then the push side. You usually push once and pull many times. Stick to only the uncompress.

@giuseppe giuseppe force-pushed the use-decompressor-filters-if-available branch from ebc2123 to 1fbb9e2 Compare June 11, 2024 10:31
@giuseppe giuseppe marked this pull request as ready for review June 11, 2024 10:58
@giuseppe
Copy link
Member Author

ready for review, CI is green

@giuseppe giuseppe force-pushed the use-decompressor-filters-if-available branch from 1fbb9e2 to 6a8d126 Compare June 11, 2024 11:09
@rhatdan
Copy link
Member

rhatdan commented Jun 11, 2024

LGTM

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 11, 2024

One side effect of using pigz is definitely that for consumers this will result in unreproducible gzip output compared to the previous Go gzip.

FWIW we never promised the compressed blobs to be reproducible (and we can’t, compression implementations make no such promises — and e.g. https://github.com/klauspost/compress is changing too often for me to think that the output never changes).

And yes, from time to time we see users who say “it’s been working for us fine for a year”, and it’s a struggle to convince them them that is it might very well break the next week and they need to change their design.

But it does break from time to time, and yes they do need to change their design; and preventing any single output change would not help avoid that need to change the design.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK to implementation.

pkg/archive/filter_test.go Outdated Show resolved Hide resolved
pkg/archive/filter.go Show resolved Hide resolved
@giuseppe giuseppe force-pushed the use-decompressor-filters-if-available branch from 6a8d126 to c283dd9 Compare June 11, 2024 12:53
@giuseppe
Copy link
Member Author

I've added a new patch to the PR. I've realized the existing implementation was leaking the buffer on errors

pkg/archive/archive.go Outdated Show resolved Hide resolved
Signed-off-by: Giuseppe Scrivano <[email protected]>
use the pigz command line tool when available as it is much faster to
decompress a gzip stream.

On my machine I've seen a 50% pull time reduction when pulling some
big images.

Signed-off-by: Giuseppe Scrivano <[email protected]>
the performance improvement is not as clear as with pigz, but it is
still measurable difference.

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe giuseppe force-pushed the use-decompressor-filters-if-available branch from c283dd9 to ae8836f Compare June 11, 2024 14:18
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

(Purely personal aesthetic preference: differentiate the err and Err variables more. Also, named return values are a bit dangerous, so emphasizing that it is a return value might help. But that’s fully up to the maintainers of c/storage.)

@openshift-ci openshift-ci bot added the lgtm label Jun 11, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 9fc521d into containers:main Jun 11, 2024
18 checks passed
@vrothberg
Copy link
Member

@edsantiago, did you see performance improvements after vendoring this change into Podman CI?

Nice work, @giuseppe !

@edsantiago
Copy link
Member

@vrothberg I'm not ignoring your question. (1) this is not yet vendored into podman, and (2) our PR merge rate is so low these days that it's hard to get performance numbers. I am keeping this on my TODO list and will report back once there's something to learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants