Horizon Lite: Come up with better index and meta compression scheme #4497

2opremio · 2022-08-02T13:30:31Z

Stored ledger metadata and more so indexes are occupying a lot of space:

The full metadata files occupies ~8TB for which we don't use a compression scheme.

A preliminary test by @Shaptic of indices built across 100 checkpoints (6400 ledgers) tells us the following:

1.2GiB in size
most individual indices were <= 250 bytes
If compressing the entire set of indices into a .tar.gz file, the size is reduced by ~44%. Note that this is different than compressing individual indices (which we already do)

Extrapolating this to a year of history (which comes with some big assumptions, like linear growth of indices with history) gives us ~1TB of raw indexes.

(Details and caveats are captured in this Slack thread. We can update this once a larger build is complete.)

@2opremio predicts we may be able to do much better with zstd, using a common index for all files: https://github.com/facebook/zstd#the-case-for-small-data-compression. This will allow us to:

Compress/decompress faster (since the dictionary is precomputed and can be cached)
Increase compression ratio
I think it would still allow us to make range queries within the compressed data (this was a concern from @bartekn). Since the dictionary is separate we can just point to a zstd frame and offset. See https://datatracker.ietf.org/doc/html/rfc8878#section-3.1

On the other hand, we would need to book-keep a separate compression dictionary which requires re-generating the files whenever we update it.

The text was updated successfully, but these errors were encountered:

Shaptic · 2022-08-02T16:49:39Z

Training works if there is some correlation in a family of small data samples. The more data-specific a dictionary is, the more efficient it is (there is no universal dictionary). Hence, deploying one dictionary per type of data will provide the greatest benefits.

Per the discussion thread re: dictionary churn, maybe we don't need to train it more than once (or at most occasionally). One training session on a block of history (or less? idk) would be representative of "account activity" which is what indices represent.

As a separate idea, maybe we can fork + modify roaring bitmaps (or sroar) to add the "NextActive" functionality that we need. Alternatively, we can convert between ours and roaring for on-disk storage, though conversion may eat up a non-trivial amount of request latency.

bartekn · 2022-08-03T12:15:09Z

zstd training is super interesting and I really wonder how it will improve the situation for us. But I think that maybe we should start with something big and then iterate in future versions of Horizon. I'm pretty sure SDF will be the only org hosting indexes for some time anyway. This discussion makes me think how important it is for us to version meta archives from the initial version.

2opremio added the Horizon v3 label Aug 2, 2022

Shaptic mentioned this issue Sep 1, 2022

Horizon Lite - MVP Epic #4317

Closed

7 tasks

Shaptic changed the title ~~Horizon light: come up with better index and meta compression scheme~~ Horizon Lite: Come up with better index and meta compression scheme Sep 1, 2022

jcx120 mentioned this issue Sep 1, 2022

Horizon Lite - Productionization / Optimization Epic #4571

Closed

64 tasks

mollykarcher added the parked label May 5, 2023

mollykarcher closed this as completed May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizon Lite: Come up with better index and meta compression scheme #4497

Horizon Lite: Come up with better index and meta compression scheme #4497

2opremio commented Aug 2, 2022 •

edited by Shaptic

Loading

Shaptic commented Aug 2, 2022 •

edited

Loading

bartekn commented Aug 3, 2022

Horizon Lite: Come up with better index and meta compression scheme #4497

Horizon Lite: Come up with better index and meta compression scheme #4497

Comments

2opremio commented Aug 2, 2022 • edited by Shaptic Loading

Shaptic commented Aug 2, 2022 • edited Loading

bartekn commented Aug 3, 2022

2opremio commented Aug 2, 2022 •

edited by Shaptic

Loading

Shaptic commented Aug 2, 2022 •

edited

Loading