-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pbzip2/pigz to decompress corpora if available #947
Conversation
Decomporessing large corpora using the standard bzip2/gzip libraries can be a slow process as they only utilize one cpu core. Take advantage of pbzip2/pigz, if available, to speed up the process by taking advantage of all cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass; looks already good but I have a few suggestions.
I am also sharing the benefits from this approach, on a machine with 12
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested it now locally and have a few more suggestions about the error output.
Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great; thank you! LGTM
Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores.
Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores.
Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores.
Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores.
Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores.
Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores.
…tic#109) Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores.
Update compressed-bytes for all corpora after re-compressing them using `pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947 this allows for much faster decompression utilizing all available CPU cores. Backport of #109 Relates #1240
Decompressing large corpora using the standard bzip2/gzip libraries
can be a slow process as they only utilize one cpu core. Take
advantage of pbzip2/pigz, if available, to speed up the process by
taking advantage of all cores.