-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
brainstorming: alternative signature storage/loading/query formats #1262
Comments
Yup, I agree.
I would really like to avoid protobuf (eg https://twitter.com/fasterthanlime/status/1340944948582113282). On the Rust side, serde has support for a bunch of formats, but performance-wise it would be better to have something that doesn't require encoding/decoding for usage (zero-copy deserialization like cap'n proto, also used by mash, or rkyv, which is rust-only), but that is not as flexible as JSON... (Tree-buf looks REALLY interesting, but still hasn't support for other languages)
Mixed feelings. I think it is a good idea when compared to using Zip files for databases, but not so sure about single signatures... Relevant read: https://www.sqlite.org/affcase1.html |
what about AVRO? https://avro.apache.org/ |
This is probably very easy to test, considering that https://github.com/flavray/avro-rs supports I was looking more into the Arrow/Parquet direction, which would also make it easier to work with more data-analysis-like workflows (loading into pandas, and so on). Another direction to consider: in #1221 I was using the bitmagic serialization/deserialization for saving nodegraphs, but it might be also a good representation for scaled minhash sketches (save a "compressed bitmap" of the hashes, instead of a list). bitmagic is not a good portable format, but I wonder if any of the options mentioned here support something along the bitmap idea. |
I started playing with the easy ones (the formats supported by serde) in https://github.com/luizirber/2021-02-11-sourmash-binary-format, will report when I have more results. |
thoughts stemming from all the manifest work that has happened: between the recent introduction of there's also the idea of storing sketches in kProcessor kDataFrames or other k-mer-specialized formats. |
side note: it would be neat to find ways of avoiding even reading or adding hashes (e.g. store them in bands #1578, or hierarchically at different scaled values). |
briefly looked into Roaring Bitmaps, https://roaringbitmap.org/about/ which has both rust and python bindings. however, while the roaring library and roaring-rs both seem to support 64-bit numbers, pyroaring does not yet - Ezibenroc/PyRoaringBitMap#58 update - also see https://pypi.org/project/roroaring64/ which supports deserialization but not serialization. and also https://pypi.org/project/pilosa-roaring/ which primarily (only?) supports serialization and deserialization. not clear if it supports 64 bits. and also https://github.com/sunzhaoping/python-croaring/ which is a cffi wrapping? but does not support 64 bits. |
I'll do a quick check on the rust one for mastiff, I really liked the API! At the moment #2230 is using rkyv to serialize/deserialize the list of datasets containing a hash, and while that process is fast it is using a regular ( |
Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon. |
Caveat on |
flatbuffers: https://google.github.io/flatbuffers/ |
I thought this was a nice 'splainer about parquet vs avro - https://stackoverflow.com/questions/28957291/avro-vs-parquet - tl;dr avro is row-based, like CSV (but with more complicated rows); parquet is column based. I think that means that parquet would be a better choice for manifests, where you might want to select on a few specific columns? While avro is essentially a replacement for JSON in the way we do things internally. |
maybe relevant? pandas 2.0 and the Arrow revolution (part I) - Marc Garcia Pandas 2.0 is going to have an Apache Arrow backend for data. This is going to eventually be a pretty big deal for large or complex data analyses - and not just because it’ll be faster, and has better data-type and missing-value handling. It will mean the in-memory data representation is now compatible (and can be used in place) by a wide range of other tools - databases (duckDB), analysis and plotting tools, file handling tools… Garcia goes much deeper into this. |
from #1226 (comment), @luizirber says:
a couple of thoughts here -
The text was updated successfully, but these errors were encountered: