Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for zstd-compressed corpora #1781

Closed
danielmitterdorfer opened this issue Sep 21, 2023 · 0 comments · Fixed by #1786
Closed

Add support for zstd-compressed corpora #1781

danielmitterdorfer opened this issue Sep 21, 2023 · 0 comments · Fixed by #1786
Labels
enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like

Comments

@danielmitterdorfer
Copy link
Member

Rally supports various compression formats such as gz or bzip. It does not support the zstd format which is perfoming significantly better in disk usage and decompression speed in my experiments. I've compressed 183GB corpus with pbzip2 and pzstd, both with the maximum compression level that is supported by the respective tool.

Format Size on disk [GB] Size on disk [GB] Relative size [%]
bzip 18613471805 18 100
zstd 11215205385 11 60

Also decompression speed is vastly superior (times measured with time, table contains the output of real, i.e. wall clock time):

Format Time to decompress [s] Relative time [%]
bzip 388 100
zstd 144 36

Therefore I propose to add support for zstd compression to Rally similar to bzip support: The fast option would require pzstd to be on PATH and a fallback can be based on the Python zstd implementation.

For reference:

  • Compress data: pzstd -19 corpus.json -o corpus.json.zstd (19 denotes the maximum compression level)
  • Decompress data: pzstd -d corpus.json.zstd -o corpus.json
@danielmitterdorfer danielmitterdorfer added enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like labels Sep 21, 2023
danielmitterdorfer added a commit to danielmitterdorfer/rally that referenced this issue Sep 27, 2023
With this commit we add support for zstd compressed corpora. Compared to
bzip, the zstd format produces compressed files that are roughly 40%
smaller and took around a third of the time to decompress in our tests.

Closes elastic#1781
danielmitterdorfer added a commit that referenced this issue Sep 27, 2023
With this commit we add support for zstd compressed corpora. Compared to
bzip, the zstd format produces compressed files that are roughly 40%
smaller and took around a third of the time to decompress in our tests.

Closes #1781
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant