Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update corpora compressed size after re-compression with pbzip2 #109

Merged
merged 1 commit into from
Apr 6, 2020

Conversation

dliappis
Copy link
Contributor

@dliappis dliappis commented Apr 3, 2020

Update compressed-bytes for all corpora after re-compressing them using
pbzip2 -9 -v -k -m10000. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.

Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
@dliappis dliappis self-assigned this Apr 3, 2020
@dliappis
Copy link
Contributor Author

dliappis commented Apr 3, 2020

Note to reviewers:

  1. To help with the review, this is the output of ls -laR from a directory containing the uncompressed and compressed with pbzip2 corpora
esbench@elasticsearch-0:~/.rally/benchmarks/races$ ls -laR pbzip2-tracks/
pbzip2-tracks/:
total 60
drwxrwxr-x 15 esbench esbench 4096 Apr  2 17:09 .
drwxr-xr-x  5 esbench esbench 4096 Apr  2 19:34 ..
drwxrwxr-x  2 esbench esbench 4096 Apr  2 19:43 eventdata
drwxrwxr-x  2 esbench esbench 4096 Apr  2 19:41 geonames
drwxrwxr-x  2 esbench esbench 4096 Apr  2 21:47 geopoint
drwxrwxr-x  2 esbench esbench 4096 Apr  2 20:01 geopointshape
drwxrwxr-x  2 esbench esbench 4096 Apr  2 20:21 geoshape
drwxrwxr-x  2 esbench esbench 4096 Apr  2 21:01 http_logs
drwxrwxr-x  2 esbench esbench 4096 Apr  2 21:01 metricbeat
drwxrwxr-x  2 esbench esbench 4096 Apr  2 20:27 nested
drwxrwxr-x  2 esbench esbench 4096 Apr  2 20:22 noaa
drwxrwxr-x  2 esbench esbench 4096 Apr  2 21:02 nyc_taxis
drwxrwxr-x  2 esbench esbench 4096 Apr  2 19:51 percolator
drwxrwxr-x  2 esbench esbench 4096 Apr  2 20:03 pmc
drwxrwxr-x  2 esbench esbench 4096 Apr  2 19:51 so

pbzip2-tracks/eventdata:
total 16826068
drwxrwxr-x  2 esbench esbench        4096 Apr  2 19:43 .
drwxrwxr-x 15 esbench esbench        4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 16437108429 Apr  2 17:07 eventdata.json
-rw-rw-r--  1 esbench esbench   792768300 Apr  2 17:07 eventdata.json.bz2

pbzip2-tracks/geonames:
total 3723476
drwxrwxr-x  2 esbench esbench       4096 Apr  2 19:41 .
drwxrwxr-x 15 esbench esbench       4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 3547613828 Apr  2 17:08 documents-2.json
-rw-rw-r--  1 esbench esbench  265208777 Apr  2 17:08 documents-2.json.bz2

pbzip2-tracks/geopoint:
total 2884888
drwxrwxr-x  2 esbench esbench       4096 Apr  2 21:47 .
drwxrwxr-x 15 esbench esbench       4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 2448564579 Apr  2 17:08 documents.json
-rw-rw-r--  1 esbench esbench  505542241 Apr  2 17:08 documents.json.bz2

pbzip2-tracks/geopointshape:
total 3197516
drwxrwxr-x  2 esbench esbench       4096 Apr  2 20:01 .
drwxrwxr-x 15 esbench esbench       4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 2780550484 Apr  2 17:08 documents.json
-rw-rw-r--  1 esbench esbench  493689712 Apr  2 17:08 documents.json.bz2

pbzip2-tracks/geoshape:
total 61637736
drwxrwxr-x  2 esbench esbench        4096 Apr  2 20:21 .
drwxrwxr-x 15 esbench esbench        4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 12592499821 Apr  2 17:09 linestrings.json
-rw-rw-r--  1 esbench esbench  3698508764 Apr  2 17:09 linestrings.json.bz2
-rw-rw-r--  1 esbench esbench  5992834062 Apr  2 17:09 multilinestrings.json
-rw-rw-r--  1 esbench esbench  1817213095 Apr  2 17:09 multilinestrings.json.bz2
-rw-rw-r--  1 esbench esbench 30178820325 Apr  2 17:08 polygons.json
-rw-rw-r--  1 esbench esbench  8837117359 Apr  2 17:08 polygons.json.bz2

pbzip2-tracks/http_logs:
total 62259752
drwxrwxr-x  2 esbench esbench        4096 Apr  2 21:01 .
drwxrwxr-x 15 esbench esbench        4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench   363512754 Apr  2 17:07 documents-181998.json
-rw-rw-r--  1 esbench esbench    13843641 Apr  2 17:07 documents-181998.json.bz2
-rw-rw-r--  1 esbench esbench   303920342 Apr  2 17:06 documents-181998.unparsed.json
-rw-rw-r--  1 esbench esbench    13088137 Apr  2 17:06 documents-181998.unparsed.json.bz2
-rw-rw-r--  1 esbench esbench  1301732149 Apr  2 17:06 documents-191998.json
-rw-rw-r--  1 esbench esbench    49546887 Apr  2 17:06 documents-191998.json.bz2
-rw-rw-r--  1 esbench esbench  1088378738 Apr  2 17:07 documents-191998.unparsed.json
-rw-rw-r--  1 esbench esbench    47290776 Apr  2 17:07 documents-191998.unparsed.json.bz2
-rw-rw-r--  1 esbench esbench  1744012279 Apr  2 17:06 documents-201998.json
-rw-rw-r--  1 esbench esbench    65759419 Apr  2 17:06 documents-201998.json.bz2
-rw-rw-r--  1 esbench esbench  1456836090 Apr  2 17:07 documents-201998.unparsed.json
-rw-rw-r--  1 esbench esbench    63278452 Apr  2 17:07 documents-201998.unparsed.json.bz2
-rw-rw-r--  1 esbench esbench  2364230815 Apr  2 17:07 documents-211998.json
-rw-rw-r--  1 esbench esbench    88445049 Apr  2 17:07 documents-211998.json.bz2
-rw-rw-r--  1 esbench esbench  1975990671 Apr  2 17:07 documents-211998.unparsed.json
-rw-rw-r--  1 esbench esbench    85739523 Apr  2 17:07 documents-211998.unparsed.json.bz2
-rw-rw-r--  1 esbench esbench  1438320123 Apr  2 17:07 documents-221998.json
-rw-rw-r--  1 esbench esbench    54274027 Apr  2 17:07 documents-221998.json.bz2
-rw-rw-r--  1 esbench esbench  1202551382 Apr  2 17:07 documents-221998.unparsed.json
-rw-rw-r--  1 esbench esbench    53264421 Apr  2 17:07 documents-221998.unparsed.json.bz2
-rw-rw-r--  1 esbench esbench  1597530673 Apr  2 17:07 documents-231998.json
-rw-rw-r--  1 esbench esbench    61043842 Apr  2 17:07 documents-231998.json.bz2
-rw-rw-r--  1 esbench esbench  1334381144 Apr  2 17:06 documents-231998.unparsed.json
-rw-rw-r--  1 esbench esbench    60795929 Apr  2 17:06 documents-231998.unparsed.json.bz2
-rw-rw-r--  1 esbench esbench 24555905444 Apr  2 17:07 documents-241998.json
-rw-rw-r--  1 esbench esbench   907295259 Apr  2 17:07 documents-241998.json.bz2
-rw-rw-r--  1 esbench esbench 20563705716 Apr  2 17:07 documents-241998.unparsed.json
-rw-rw-r--  1 esbench esbench   899190175 Apr  2 17:07 documents-241998.unparsed.json.bz2

pbzip2-tracks/metricbeat:
total 1310240
drwxrwxr-x  2 esbench esbench       4096 Apr  2 21:01 .
drwxrwxr-x 15 esbench esbench       4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 1249705758 Apr  2 17:08 documents.json
-rw-rw-r--  1 esbench esbench   91964149 Apr  2 17:08 documents.json.bz2

pbzip2-tracks/nested:
total 4231756
drwxrwxr-x  2 esbench esbench       4096 Apr  2 20:27 .
drwxrwxr-x 15 esbench esbench       4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 3637747670 Apr  2 17:06 documents.json
-rw-rw-r--  1 esbench esbench  695550727 Apr  2 17:06 documents.json.bz2

pbzip2-tracks/noaa:
total 10429456
drwxrwxr-x  2 esbench esbench       4096 Apr  2 20:22 .
drwxrwxr-x 15 esbench esbench       4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 9684262698 Apr  2 17:07 documents.json
-rw-rw-r--  1 esbench esbench  995480468 Apr  2 17:07 documents.json.bz2

pbzip2-tracks/nyc_taxis:
total 82639232
drwxrwxr-x  2 esbench esbench        4096 Apr  2 21:02 .
drwxrwxr-x 15 esbench esbench        4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 79802445255 Apr  2 17:07 documents.json
-rw-rw-r--  1 esbench esbench  4820107188 Apr  2 17:07 documents.json.bz2

pbzip2-tracks/percolator:
total 107596
drwxrwxr-x  2 esbench esbench      4096 Apr  2 19:51 .
drwxrwxr-x 15 esbench esbench      4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 110039748 Apr  2 17:09 queries-2.json
-rw-rw-r--  1 esbench esbench    124009 Apr  2 17:09 queries-2.json.bz2

pbzip2-tracks/pmc:
total 28503708
drwxrwxr-x  2 esbench esbench        4096 Apr  2 20:03 .
drwxrwxr-x 15 esbench esbench        4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 23256051757 Apr  2 17:06 documents.json
-rw-rw-r--  1 esbench esbench  5931724449 Apr  2 17:06 documents.json.bz2

pbzip2-tracks/so:
total 44106976
drwxrwxr-x  2 esbench esbench        4096 Apr  2 19:51 .
drwxrwxr-x 15 esbench esbench        4096 Apr  2 17:09 ..
-rw-rw-r--  1 esbench esbench 35564808298 Apr  2 17:08 posts.json
-rw-rw-r--  1 esbench esbench  9600716233 Apr  2 17:08 posts.json.bz2
  1. This PR must not be merged unless the new corpora have been uploaded and replaced the existing ones. Uploading will take place after nightly jobs (https://elasticsearch-ci.elastic.co/view/All/job/elastic+elasticsearch+master+macrobenchmark-periodic-group-1/ / https://elasticsearch-ci.elastic.co/view/All/job/elastic+elasticsearch+master+macrobenchmark-periodic-group-2) have finished.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dliappis dliappis merged commit ef1ac55 into elastic:master Apr 6, 2020
dliappis added a commit that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit that referenced this pull request Apr 6, 2020
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
dliappis added a commit to dliappis/rally-tracks that referenced this pull request Apr 14, 2021
…tic#109)

Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.
@dliappis
Copy link
Contributor Author

Backport to 5: #167

dliappis added a commit that referenced this pull request Apr 15, 2021
Update compressed-bytes for all corpora after re-compressing them using
`pbzip2 -9 -v -k -m10000`. Together with elastic/rally#947
this allows for much faster decompression utilizing all available CPU cores.

Backport of #109
Relates #1240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants