-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark datasets #171
Benchmark datasets #171
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't the initial plan to create a submodule of the real-roaring-dataset repository and to let the user submodule init/update
by himself?
Correct. This avoid a manual step. Is there some other concern regarding adding these? Would it bloat the size of the published crate? |
Yup, but cloning the repository is a little issue now, 18Mo of dataset is a lot. Also, I'm not sure that we want to make git care about them. I would prefer that we depend on the official repo, even with a specific revision.
Yup, the other concern is about the size of the repository itself, not the crate as you put the datasets in the benchmark subcrate. I hope that cargo doesn't push useless folders! |
The difference between this PR and all the zips is rather large ~18 MB compared to ~95. zstd is ✨ magic ✨
I have verified, it does not. My main concern is admittedly a subjective one. It just feels icky 🤢 to introduce a manual fetch step to run benchmarks. |
Maybe we can I am sure that we can find a solution to this by automating the clone in the It looks to be that easy: |
This uses the zstd create under the hood, they are decompressed in the benchmark code.
Valid. I'm going to close this and explore using |
BTW, I don't think that creating a crate to store the benchmarks is a great idea: rust-lang/crates.io#195. The crate size limit is around 10 MB. |
186: add runtime dataset fetch and parse in-place r=Kerollmops a=saik0 Closes #129 Closes #171 Closes #185 Here's my go at fetching the datasets at runtime * Datasets are lazily fetched the first time they're needed (or updated, if local `HEAD != origin/master`). * The zip files are parsed-in place on every benchmark run, to keep the on-disk size down. * The parsing is also lazy, and happens at most once. * This PR updates any benchmarks that were already using limited data from `wikileaks-noquotes` to use all the datasets. * A fast follow PR will update all the benchmarks. `@Kerollmops` Third times the charm? Co-authored-by: saik0 <[email protected]> Co-authored-by: Joel Pedraza <[email protected]>
186: add runtime dataset fetch and parse in-place r=Kerollmops a=saik0 Closes RoaringBitmap#129 Closes RoaringBitmap#171 Closes RoaringBitmap#185 Here's my go at fetching the datasets at runtime * Datasets are lazily fetched the first time they're needed (or updated, if local `HEAD != origin/master`). * The zip files are parsed-in place on every benchmark run, to keep the on-disk size down. * The parsing is also lazy, and happens at most once. * This PR updates any benchmarks that were already using limited data from `wikileaks-noquotes` to use all the datasets. * A fast follow PR will update all the benchmarks. `@Kerollmops` Third times the charm? Co-authored-by: saik0 <[email protected]> Co-authored-by: Joel Pedraza <[email protected]>
Closes #129
Adds CRoaring benchmark datasets. File contents are zstd compressed serialized bitmaps using a shared dictionary. All together adds about 18 MiB.
Utilizing the datasets is out of scope for this PR.