Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pbzip2/pigz to decompress corpora if available #947

Merged
merged 6 commits into from
Apr 6, 2020

Conversation

dliappis
Copy link
Contributor

@dliappis dliappis commented Apr 2, 2020

Decompressing large corpora using the standard bzip2/gzip libraries
can be a slow process as they only utilize one cpu core. Take
advantage of pbzip2/pigz, if available, to speed up the process by
taking advantage of all cores.

Decomporessing large corpora using the standard bzip2/gzip libraries
can be a slow process as they only utilize one cpu core.  Take
advantage of pbzip2/pigz, if available, to speed up the process by
taking advantage of all cores.
@dliappis dliappis added enhancement Improves the status quo :Usability Makes Rally easier to use labels Apr 2, 2020
@dliappis dliappis added this to the 1.5.0 milestone Apr 2, 2020
@dliappis dliappis self-assigned this Apr 2, 2020
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass; looks already good but I have a few suggestions.

esrally/utils/io.py Outdated Show resolved Hide resolved
esrally/utils/io.py Show resolved Hide resolved
esrally/utils/io.py Outdated Show resolved Hide resolved
esrally/utils/io.py Outdated Show resolved Hide resolved
esrally/utils/io.py Outdated Show resolved Hide resolved
esrally/utils/io.py Outdated Show resolved Hide resolved
esrally/utils/io.py Outdated Show resolved Hide resolved
@dliappis
Copy link
Contributor Author

dliappis commented Apr 2, 2020

I am also sharing the benefits from this approach, on a machine with 12 Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz cores and a Micron_5100_MTFDDAK1T9TBY nvme disk.

  1. Using the nyc_taxis corpus as it is currently compressed with standard bzip2, decompressing with current Rally master takes: ~29m. Only 1 core is 100% used throughout the process.

  2. Using a recompressed corpus with pbzip2 -v -k -m10000 documents.json and extracting using this PR: ~7m 30s, i.e. ~75% reduction in execution time. All cores are 100% used throughout extraction.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it now locally and have a few more suggestions about the error output.

esrally/utils/io.py Outdated Show resolved Hide resolved
esrally/utils/io.py Outdated Show resolved Hide resolved
dliappis added a commit to dliappis/rally-tracks that referenced this pull request Apr 3, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great; thank you! LGTM

dliappis added a commit to elastic/rally-tracks that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
@dliappis dliappis merged commit d7c3575 into elastic:master Apr 6, 2020
dliappis added a commit to elastic/rally-tracks that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit to elastic/rally-tracks that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit to elastic/rally-tracks that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit to elastic/rally-tracks that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit to elastic/rally-tracks that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit to dliappis/rally that referenced this pull request Apr 7, 2020
dliappis added a commit that referenced this pull request Apr 7, 2020
dliappis added a commit to dliappis/rally-tracks that referenced this pull request Apr 14, 2021
…tic#109)

Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit to elastic/rally-tracks that referenced this pull request Apr 15, 2021
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.

Backport of #109
Relates #1240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Usability Makes Rally easier to use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants