Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume annotation download: zip with BEST_SPEED #6036

Merged
merged 9 commits into from
Feb 14, 2022
Merged

Conversation

fm3
Copy link
Member

@fm3 fm3 commented Feb 10, 2022

Use Deflater.BEST_SPEED (level 1) instead of level 6 when writing data.zip containing volume buckets and also when adding that file to the outer annotation zip.

My experiments showed a few interesting insights

  • For volume annotation buckets, there is no huge difference in zip size between level 1 (BEST_SPEED) and 6 (DEFAULT), but there is a ~40% speedup.
  • Level 0 (NO_COMPRESSION) has an even greater speedup (another 20% compared to 1, possibly more for smaller annotations) but produces significantly larger zip files (12 times as large in my test case).
  • The file paths are a significant part of the archive (lz4 compressed buckets are small, but 500k files lead to large amount of paths)
  • It appears zip does not (very well?) compress its file path listing
  • This is why it is possible to compress the already-compressed data.zip file again on the wk-side with actual size reduction (getting 60% reduction for large annotation)
  • Note that this varies heavily with the zip implementation. The zip command line tool also offers levels from 0 to 9 but got very different results both in size and timing. I measured with the same java implementation that our backend uses
  • See full experiments below

URL of deployed dev instance (used for testing):

Steps to test:

  • Create hybrid annotation, download it
  • should still be a valid zip with valid inner data.zip

Issues:


Measurements

2022-02-10 measure performance of zipping (already lz4-compressed) volume annotation buckets.
All files were in memory before, as Tuple[Path,Array[Byte]]. Paths are relative. level -1 means default, I believe it is 6

Huge soma volume annotation, 510501 files, total size (without paths): 486,876,075 bytes
compressing buckets into data.zip:
Level  0, ratio 114.3 % took  3270 ms  # size increase because the file paths are in the archive too
Level  0, ratio 114.3 % took  2651 ms
Level  0, ratio 114.3 % took  2466 ms
Level  0, ratio 114.3 % took  3151 ms
Level  1, ratio  23.8 % took  5927 ms
Level  1, ratio  23.8 % took  5061 ms
Level  1, ratio  23.8 % took  5204 ms
Level  1, ratio  23.8 % took  6942 ms
Level  2, ratio  23.7 % took  6658 ms
Level  2, ratio  23.7 % took  6731 ms
Level  2, ratio  23.7 % took  7533 ms
Level  2, ratio  23.7 % took  7649 ms
Level -1, ratio  22.9 % took  9102 ms
Level -1, ratio  22.9 % took  9034 ms
Level -1, ratio  22.9 % took  8982 ms
Level -1, ratio  22.9 % took  9065 ms
Level  9, ratio  22.9 % took  9990 ms
Level  9, ratio  22.9 % took 10078 ms
Level  9, ratio  22.9 % took 10175 ms
Level  9, ratio  22.9 % took 10075 ms


Smallish dense volume annotation, 688 files, total size (without paths): 4,265,446 bytes
compressing buckets into data.zip:

Level  0, ratio 102.1 % took   6 ms
Level  0, ratio 102.1 % took   7 ms
Level  0, ratio 102.1 % took   6 ms
Level  0, ratio 102.1 % took   6 ms
Level  1, ratio  52.7 % took  79 ms
Level  1, ratio  52.7 % took  75 ms
Level  1, ratio  52.7 % took  76 ms
Level  1, ratio  52.7 % took  75 ms
Level  2, ratio  52.0 % took  85 ms
Level  2, ratio  52.0 % took  82 ms
Level  2, ratio  52.0 % took  81 ms
Level  2, ratio  52.0 % took  83 ms
Level -1, ratio  49.7 % took 147 ms
Level -1, ratio  49.7 % took 147 ms
Level -1, ratio  49.7 % took 148 ms
Level -1, ratio  49.7 % took 147 ms
Level  9, ratio  49.7 % took 152 ms
Level  9, ratio  49.7 % took 151 ms
Level  9, ratio  49.7 % took 148 ms
Level  9, ratio  49.7 % took 148 ms



Compressing the inner (soma annotation, see above) data.zip (level-1, 116 MB, 500k files) again:

Level  0, ratio 100.0 % took  157 ms
Level  0, ratio 100.0 % took  183 ms
Level  0, ratio 100.0 % took  133 ms
Level  0, ratio 100.0 % took  131 ms
Level  1, ratio  38.2 % took 1318 ms
Level  1, ratio  38.2 % took 1253 ms
Level  1, ratio  38.2 % took 1272 ms
Level  1, ratio  38.2 % took 1296 ms
Level  2, ratio  38.1 % took 1367 ms
Level  2, ratio  38.1 % took 1323 ms
Level  2, ratio  38.1 % took 1328 ms
Level  2, ratio  38.1 % took 1309 ms
Level -1, ratio  36.9 % took 2046 ms
Level -1, ratio  36.9 % took 2054 ms
Level -1, ratio  36.9 % took 2056 ms
Level -1, ratio  36.9 % took 2335 ms
Level  9, ratio  36.5 % took 7614 ms
Level  9, ratio  36.5 % took 7050 ms
Level  9, ratio  36.5 % took 6499 ms
Level  9, ratio  36.5 % took 6415 ms

@fm3 fm3 self-assigned this Feb 10, 2022
@fm3 fm3 changed the title Zip partial compression Volume annotation download: zip with BEST_SPEED Feb 10, 2022
@fm3 fm3 marked this pull request as ready for review February 10, 2022 13:28
@fm3 fm3 requested a review from jstriebel February 10, 2022 13:30
Copy link
Contributor

@jstriebel jstriebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great findings, LGTM! 👍

It appears zip does not (very well?) compress its file path listing
This is why it is possible to compress the already-compressed data.zip file again on the wk-side with actual size reduction (getting 60% reduction for large annotation)

Interesting that this is a bottleneck in our case. This might be an indicator that we want to have larger file-lengths, resulting in fewer shards. But that surely is a different issue.

CHANGELOG.unreleased.md Outdated Show resolved Hide resolved
@fm3 fm3 merged commit d641f7a into master Feb 14, 2022
@fm3 fm3 deleted the zip-partial-compression branch February 14, 2022 11:56
hotzenklotz added a commit that referenced this pull request Feb 18, 2022
…ssos into docs

* 'docs' of github.com:scalableminds/webknossos:

* 'master' of github.com:scalableminds/webknossos:
  Split cells via Min Cut (#5885)
  Clean up backend util package (#6048)
  Guard against empty saves (#6052)
  Time tracking: Do not fail on empty timespans list (#6051)
  Fix clip button changing position (#6050)
  Include ParamFailure values in error chains (#6045)
  Fix non-32-aligned bucket requests (#6047)
  Don't enforce save state when saving is triggered by a timeout and reduce tracing layout analytics event count (#5999)
  Bump cached-path-relative from 1.0.2 to 1.1.0 (#5994)
  Volume annotation download: zip with BEST_SPEED (#6036)
  Sensible scalebar values (#6034)
  Faster CircleCI builds (#6040)
  move to Google Analytics 4 (#6031)
  Fix nightly (fix tokens, upgrade puppeteer) (#6032)
  Add neuron reconstruction job backend and frontend part (#5922)
  Allow uploading multi-layer volume annotations (#6028)
hotzenklotz added a commit that referenced this pull request Feb 18, 2022
* docs:
  Split cells via Min Cut (#5885)
  Clean up backend util package (#6048)
  Guard against empty saves (#6052)
  Time tracking: Do not fail on empty timespans list (#6051)
  Fix clip button changing position (#6050)
  Include ParamFailure values in error chains (#6045)
  Fix non-32-aligned bucket requests (#6047)
  Don't enforce save state when saving is triggered by a timeout and reduce tracing layout analytics event count (#5999)
  Bump cached-path-relative from 1.0.2 to 1.1.0 (#5994)
  Volume annotation download: zip with BEST_SPEED (#6036)
  Sensible scalebar values (#6034)
  Faster CircleCI builds (#6040)
  move to Google Analytics 4 (#6031)
  Fix nightly (fix tokens, upgrade puppeteer) (#6032)
  Add neuron reconstruction job backend and frontend part (#5922)
  Allow uploading multi-layer volume annotations (#6028)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Performance for downloading annotations
2 participants