Add support for zstd compression #14706

grossag · 2023-09-08T18:12:49Z

Changelog: (Feature): Add support for zstd compression
Docs: Will create one if this PR is acceptable

Refer to the issue that supports this Pull Request.
If the issue has missing info, explain the purpose/use case/pain/need that covers this Pull Request.
I've read the Contributing guide.
I've followed the PEP8 style guides for Python code.
I've opened another PR in the Conan docs repo to the develop branch, documenting this one.

As discussed in issue #648, this change adds zstd support to conan in the following ways:

The person or build running conan upload can set a config value core.upload:compression_format = zstd to upload binaries using zstd instead of gzip.
The zstd compression is done entirely in Python using a combination of tarfile and python-zstandard. Then the file is uploaded as normal.
When downloading packages, if a .tar.zst file is encountered, the extraction code uses tarfile and python-zstandard to extract.
Adds a test to cover zstd compression and decompression.

I chose python-zstandard as the library because that is what urllib3 uses. The package has not yet hit 1.0 but urllib3 is a mature package and it says a lot to me that they chose python-zstandard.

I apologize in advance if I'm missing important parts of the developer workflow. If this approach is acceptable, I'll create a docs PR as requested.

Developer docs on all branches say to open pull requests against develop but AFAICT that is Conan 1.x. I'm opening this against develop2 instead because that appears to be Conan 2.x; I hope that's the right thing to do.

This change adds zstd support to conan in the following ways: 1. The person or build running `conan upload` can set a config value core.upload:compression_format = zstd to upload binaries using zstd instead of gzip. 2. The zstd compression is done entirely in Python using a combination of tarfile and python-zstandard. Then the file is uploaded as normal. 3. When downloading packages, if a .tar.zst file is encountered, the extraction code uses tarfile and python-zstandard to extract. I chose python-zstandard as the library because that is what urllib3 uses.

Because zstd decompression is expected to just work if the server has a .tar.zst file, I am including zstandard in requirements.txt. https://python-zstandard.readthedocs.io/en/latest/projectinfo.html#state-of-project recommends that we "Pin the package version to prevent unwanted breakage when this change occurs!", although I doubt that much will change before an eventual 1.0.

CLAassistant · 2023-09-08T18:12:54Z

All committers have signed the CLA.

CI is unable to find 0.21.0

grossag · 2023-09-11T16:22:49Z

I am working through my company's CLA approval process and hope to sign it by end of day today. In the meantime, I wrote a script to test compression and decompression of a test folder using various gzip and zstd compression levels and ran it overnight on a 7.1GB folder with 16000 files. I put the script here in case you all find it useful: https://gist.github.com/grossag/525f3cdaf7d985b625a38df55a7c9087

Run on a VM using shared NAS storage:

gzip level 7:
	- Compression time: 554.04 seconds
	- Compression size: 1.988 GB
	- Decompression times in seconds: 89.50 mean, 88.05 median, 2.04 stdev
gzip level 8:
	- Compression time: 1079.89 seconds
	- Compression size: 1.978 GB
	- Decompression times in seconds: 87.20 mean, 87.05 median, 1.84 stdev
gzip level 9 (default compression level):
	- Compression time: 2080.31 seconds
	- Compression size: 1.976 GB
	- Decompression times in seconds: 85.34 mean, 86.40 median, 3.10 stdev

zstd level 3 (default compression level):
	- Compression time: 136.69 seconds
	- Compression size: 1.498 GB
	- Decompression times in seconds: 54.33 mean, 54.14 median, 4.04 stdev
zstd level 4:
	- Compression time: 125.80 seconds
	- Compression size: 1.478 GB
	- Decompression times in seconds: 52.21 mean, 52.98 median, 2.19 stdev
zstd level 5:
	- Compression time: 115.56 seconds
	- Compression size: 1.398 GB
	- Decompression times in seconds: 50.87 mean, 51.97 median, 4.49 stdev

My work laptop has widely varying performance right now, where zstd decompression of the same files jumps between 47 and 60 seconds. So here are the first results which I need to rerun:

gzip level 9 (default compression level):
	- Compression time: 1144.69 seconds
	- Compression size: 1.975 GB
	- Decompression times in seconds: 94.60 mean, 97.29 median, 8.86 stdev
zstd level 5:
	- Compression time: 69.69 seconds
	- Compression size: 1.394 GB
	- Decompression times in seconds: 55.63 mean, 59.06 median, 8.00 stdev

zstd is interesting because my results are showing that decompression time doesn't change as you increase the compression level, maybe with the exception of the really high levels 20-22. But overall my results are summarized as: on both machines, zstd level 5 shows a size reduction of 30% and a decompression time reduction of 35-40% as compared to gzip level 9.

grossag · 2023-09-11T20:21:50Z

Looks like my virus scanner was causing the high variance in hash performance testing. Here are some results, comparing zstd level 9 with gzip level 9 on my laptop:

Boost (1.1GB and 15000 files):

gzip level 9 (default compression level):
	- Compression time: 178.97 seconds
	- Compression size: 192.741 MB
	- Decompression times in seconds: 9.89 mean, 9.95 median, 0.09 stdev
zstd level 9:
	- Compression time: 12.34 seconds
	- Compression size: 130.144 MB
	- Decompression times in seconds: 6.99 mean, 7.01 median, 0.10 stdev

Compiler toolset (7.1 GB and 16000 files):

gzip level 9 (default compression level):
	- Compression time: 1473.80 seconds
	- Compression size: 1.975 GB
	- Decompression times in seconds: 34.64 mean, 34.54 median, 0.23 stdev
zstd level 9:
	- Compression time: 91.72 seconds
	- Compression size: 1.258 GB
	- Decompression times in seconds: 17.33 mean, 16.95 median, 1.01 stdev

So my tests are still showing 20-50% improvements in decompression time.

1. Change requirements.txt to allow either zstandard 0.20 or 0.21. That prevents a downgrade for people who already have 0.21 installed, while also allowing CI to find 0.20. 2. Move compressformat parameter earlier in compress_files() function. It made a bit more sense to have it earlier; as long as consumers are correctly using positional kwargs, it shouldn't break anyone.

13steinj · 2023-09-27T14:25:37Z

one way or the other I'll have to implement this + more for my org eventually--

can this be changed to be done in an expandable manner? Something like:

core.packager.binaries.compressor.native = false # one of true, false; true uses a command
core.packager.binaries.compressor = gzip  # one of pigz, gzip, bzip2, xz, lzip, lzma, lzop, gzip, zstd; if native, also arbitrary
core.packager.binaries.compressor.suffix = auto  # one of auto, or if compressor.native AND unknown compressor, custom defaulting to first word of compressor (program)
core.packager.binaries.archiver = python, native # one of python, tar. Defaults to decompressor, native defaults to tar. 
core.packager.binaries.decompressor = python # one of python, tar (auto detect default), or native (based off of suffix
core.packager.binaries.dearchiver = python  # one of python, native (tar). Defaults to decompressor (tar and native defaults to tar).

maybe some other variations... bit of a hard problem to make this workable for everyone.

grossag · 2023-09-27T15:19:18Z

one way or the other I'll have to implement this + more for my org eventually--

can this be changed to be done in an expandable manner? Something like:


core.packager.binaries.compressor.native = false # one of true, false; true uses a command

core.packager.binaries.compressor = gzip  # one of pigz, gzip, bzip2, xz, lzip, lzma, lzop, gzip, zstd; if native, also arbitrary

core.packager.binaries.compressor.suffix = auto  # one of auto, or if compressor.native AND unknown compressor, custom defaulting to first word of compressor (program)

core.packager.binaries.archiver = python, native # one of python, tar. Defaults to decompressor, native defaults to tar. 

core.packager.binaries.decompressor = python # one of python, tar (auto detect default), or native (based off of suffix

core.packager.binaries.dearchiver = python  # one of python, native (tar). Defaults to decompressor (tar and native defaults to tar).

maybe some other variations... bit of a hard problem to make this workable for everyone.

Hey, thanks for the review! What is your use case that requires this additional customization? I found that using a separate tar process made things difficult to manage and was actually a tiny bit slower than in-proc python-zstandard.

13steinj · 2023-09-27T21:07:13Z

Hey, thanks for the review! What is your use case that requires this additional customization? I found that using a separate tar process made things difficult to manage and was actually a tiny bit slower than in-proc python-zstandard.

With respect to native vs non-native (subprocess or python-level), my experience has unfortunately been that the parallel downloads feature is "not actually parallel" because tar extraction is not parallel (partially due to the GIL, partially due to how the code is structured). This was tested on conan 1.5X. On large binary packages and no other core/job restrictions, parallel_downloads was fastest set to 2 rather than 16, with a large amount of time spent wasted in tarfile 😢 .

While at a previous org, monkeypatch-experimenting (because everything is python, yay!) to replace with a native call to tar + pigz (so called "fake" parallelism for decompression) was faster, and I expect pugz to be even faster.

This isn't to say I don't want this feature you've written, I do! But with conan's committal to backwards compatibility in 2.0, I would expect that config options need to either have a lot of granularity in order to suffice for future use cases (for example, some binary packaged data that I've had played with over the past year suggests that bz2 is optimal instead).

I'm less so asking for additional customization right now and more so for the config to be structured so that additional customization can be added later. Ex core.upload may be a poor choice, and there is already, unfortunately, core.gzip:compresslevel that I assume would be better off under some sub-namespace but now has to work for the foreseeable future.

conans/client/cmd/uploader.py

grossag · 2023-10-10T13:25:38Z

Hey, thanks for the review! What is your use case that requires this additional customization? I found that using a separate tar process made things difficult to manage and was actually a tiny bit slower than in-proc python-zstandard.

With respect to native vs non-native (subprocess or python-level), my experience has unfortunately been that the parallel downloads feature is "not actually parallel" because tar extraction is not parallel (partially due to the GIL, partially due to how the code is structured). This was tested on conan 1.5X. On large binary packages and no other core/job restrictions, parallel_downloads was fastest set to 2 rather than 16, with a large amount of time spent wasted in tarfile 😢 .

While at a previous org, monkeypatch-experimenting (because everything is python, yay!) to replace with a native call to tar + pigz (so called "fake" parallelism for decompression) was faster, and I expect pugz to be even faster.

This isn't to say I don't want this feature you've written, I do! But with conan's committal to backwards compatibility in 2.0, I would expect that config options need to either have a lot of granularity in order to suffice for future use cases (for example, some binary packaged data that I've had played with over the past year suggests that bz2 is optimal instead).

I'm less so asking for additional customization right now and more so for the config to be structured so that additional customization can be added later. Ex core.upload may be a poor choice, and there is already, unfortunately, core.gzip:compresslevel that I assume would be better off under some sub-namespace but now has to work for the foreseeable future.

This is something I would want direction from the maintainers on if I was to do. In the discussion with @memsharded in #648 , the idea of deferring compression and decompression to native tools was not ideal because of testing and compatibility concerns. That is why I tried to do this in all Python code. I am very happy with the Python zstd decompression performance so far across Windows, Mac, and Linux. The python-zstandard library releases the GiL before calling into the Zstd C library, so you aren’t losing performance there.

This change represents the most minimal one I could do while still accomplishing my goals. On the GH issue I referenced, accepting this PR was not guaranteed so I wanted to limit the complexity to show that it is supportable long-term.

13steinj · 2023-10-11T02:56:36Z

Fair enough; I'm mainly concerned about other package types and how this interacts with compatibility concerns.

To be clear I'm not suggesting you implement all of these methods right now, just for the config key chosen to be extensible for the future.

conans/client/cmd/uploader.py

conans/client/remote_manager.py

conans/client/cmd/uploader.py

Ext3h · 2023-10-31T21:55:43Z

Decompression times in seconds: 17.33 mean, 16.95 median, 1.01 stdev

This still ain't looking right. This is under performing by a factor 2-3x compared to what it should look like.

There is likely some major overhead as tarfile is serializing an embarrassingly parallel problem by hopping between blocking file system accesses and decompression in a single thread...

conans/client/cmd/uploader.py

1. Fix bad merge causing uploader.py change to still refer to `self._app.cache.new_config`, when now we are supposed to use `self._global_conf`. 2. Change two output calls in uploader.py to only output the package file basename to be consistent with other existing log lines. 3. Use double quotes instead of single quotes to be more consistent with existing code.

1. Downgrade bufsize to 32KB because that performs well for compression and decompression. The values don't need to be the same, but it happened to be the best value in both compression and decompression tests. 2. Use a context manager for stream_reader as I do for stream_writer. 3. Add some comments about the bufsize value.

grossag · 2024-08-26T13:10:23Z

@memsharded Are you able to review this PR?

memsharded

Thanks for your contribution, sorry that we haven't been able to have time to review this.

This PR as is, is looking a bit risky, one of the main reasons the addition of the new zstandard library dependency. It is likely that it might be better added as conditional requirement (and protect the import of it with a try-except with a clear message).

But I'd say that it is not impossible to move it forward, based on the diff, I think the code changes risk might be controlled. Please check the comments.

Thanks again for your contribution.

conans/client/cmd/uploader.py

memsharded · 2024-09-13T03:43:14Z

conans/client/remote_manager.py

                if f not in zipped_files:
                    raise ConanException(f"Corrupted {pref} in '{remote.name}' remote: no {f}")
+            accepted_package_files = [PACKAGE_TZSTD_NAME, PACKAGE_TGZ_NAME]
+            package_file = next((f for f in zipped_files if f in accepted_package_files), None)


Basically, a package could contain both compressed artifacts, but it will prioritize and only download the zstd one if existing?

Wouldn't it be a bit less confusing to not allow to have both compressed formats artifacts in the same package?

A package is only supposed to contain one. Let's say an organization switches to zstd compression on Jan 1 2025. The expectation would be that packages produced before then would have .tgz extension and packages produced after then would have .tzst extension. I would like to avoid producing both because it would result in unnecessary storage usage in Artifactory.

conans/client/cmd/uploader.py

memsharded · 2024-09-13T03:47:34Z

conans/client/rest/rest_client_v2.py

+            accepted_package_files = [PACKAGE_TZSTD_NAME, PACKAGE_TGZ_NAME]
+            accepted_files = ["conaninfo.txt", "conanmanifest.txt", "metadata/sign"]
+            for f in accepted_package_files:
+                if f in server_files:
+                    accepted_files = [f] + accepted_files
+                    break


If we assumed there can only be 1 compressed artifact in one of the formats, this would be simplified?

Sorry, I think I'm missing what you are saying here. I don't have the context about if/how these accepted files changed over time. But Artifactory would only have .tgz or .tzst, not both. If that means we can simplify this a bit, that's fine with me.

conans/model/conf.py

conan/internal/paths.py

conans/requirements.txt

Still need to do some testing though.

Newer Python has this warning: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior

Squashed version of PR conan-io#14706 as of 03/10/2024.

grossag added 3 commits September 7, 2023 16:04

Add a test case to cover zstd compress and decompress

396815c

Downgrade to 0.20.0 to fix CI

f0b7813

CI is unable to find 0.21.0

13steinj reviewed Sep 27, 2023

View reviewed changes

conans/client/cmd/uploader.py Outdated Show resolved Hide resolved

conans/client/cmd/uploader.py Outdated Show resolved Hide resolved

conans/client/cmd/uploader.py Outdated Show resolved Hide resolved

exjam force-pushed the topic/grossag/zstd3 branch from 7d586bc to a33394d Compare October 17, 2023 22:56