Benchmark datasets #171

saik0 · 2022-01-31T19:12:41Z

Closes #129

Adds CRoaring benchmark datasets. File contents are zstd compressed serialized bitmaps using a shared dictionary. All together adds about 18 MiB.

Utilizing the datasets is out of scope for this PR.

Kerollmops

Wasn't the initial plan to create a submodule of the real-roaring-dataset repository and to let the user submodule init/update by himself?

saik0 · 2022-02-01T17:32:34Z

Wasn't the initial plan to create a submodule of the real-roaring-dataset repository and to let the user submodule init/update by himself?

Correct. This avoid a manual step.

Is there some other concern regarding adding these? Would it bloat the size of the published crate?

Kerollmops · 2022-02-01T17:48:10Z

Correct. This avoid a manual step.

Yup, but cloning the repository is a little issue now, 18Mo of dataset is a lot. Also, I'm not sure that we want to make git care about them. I would prefer that we depend on the official repo, even with a specific revision.

Is there some other concern regarding adding these? Would it bloat the size of the published crate?

Yup, the other concern is about the size of the repository itself, not the crate as you put the datasets in the benchmark subcrate. I hope that cargo doesn't push useless folders!

saik0 · 2022-02-01T18:17:24Z

Yup, the other concern is about the size of the repository itself

The difference between this PR and all the zips is rather large ~18 MB compared to ~95. zstd is ✨ magic ✨
This size reduction was what led me to just include them in the benchmark crate.

I hope that cargo doesn't push useless folders!

I have verified, it does not.

My main concern is admittedly a subjective one. It just feels icky 🤢 to introduce a manual fetch step to run benchmarks.

Kerollmops · 2022-02-01T20:35:19Z

My main concern is admittedly a subjective one. It just feels icky 🤢 to introduce a manual fetch step to run benchmarks.

Maybe we can git submodule init/update by ourselves in the build.rs. I just checked the size of the repository, not even compressed, it's 628K, adding 18M to that is a lot. The other downside I see is that the user needs to install zstd to be able to run the benchmarks.

I am sure that we can find a solution to this by automating the clone in the build.rs script or something. It could even just be a call to the git submodule init/update Command. What do you think?

It looks to be that easy:

https://github.com/rust-lang/git2-rs/blob/c55bd6dbdba52f90788150180ef124ef6c90daa6/libgit2-sys/build.rs#L27-L31

saik0 · 2022-02-01T23:51:45Z

The other downside I see is that the user needs to install zstd to be able to run the benchmarks.

This uses the zstd create under the hood, they are decompressed in the benchmark code.

I just checked the size of the repository, not even compressed, it's 628K, adding 18M to that is a lot.

Valid.

I'm going to close this and explore using git binary through Command and also check to see if there's some other magic cargo can do. It might make sense to have a crate of datasets.

Kerollmops · 2022-02-08T10:59:43Z

I'm going to close this and explore using git binary through Command and also check to see if there's some other magic cargo can do. It might make sense to have a crate of datasets.

BTW, I don't think that creating a crate to store the benchmarks is a great idea: rust-lang/crates.io#195. The crate size limit is around 10 MB.

186: add runtime dataset fetch and parse in-place r=Kerollmops a=saik0 Closes #129 Closes #171 Closes #185 Here's my go at fetching the datasets at runtime * Datasets are lazily fetched the first time they're needed (or updated, if local `HEAD != origin/master`). * The zip files are parsed-in place on every benchmark run, to keep the on-disk size down. * The parsing is also lazy, and happens at most once. * This PR updates any benchmarks that were already using limited data from `wikileaks-noquotes` to use all the datasets. * A fast follow PR will update all the benchmarks. `@Kerollmops` Third times the charm? Co-authored-by: saik0 <[email protected]> Co-authored-by: Joel Pedraza <[email protected]>

186: add runtime dataset fetch and parse in-place r=Kerollmops a=saik0 Closes RoaringBitmap#129 Closes RoaringBitmap#171 Closes RoaringBitmap#185 Here's my go at fetching the datasets at runtime * Datasets are lazily fetched the first time they're needed (or updated, if local `HEAD != origin/master`). * The zip files are parsed-in place on every benchmark run, to keep the on-disk size down. * The parsing is also lazy, and happens at most once. * This PR updates any benchmarks that were already using limited data from `wikileaks-noquotes` to use all the datasets. * A fast follow PR will update all the benchmarks. `@Kerollmops` Third times the charm? Co-authored-by: saik0 <[email protected]> Co-authored-by: Joel Pedraza <[email protected]>

saik0 added 2 commits January 31, 2022 10:53

add all benchmark datasets

8745abb

add dataset readme

f639cda

Kerollmops requested changes Feb 1, 2022

View reviewed changes

saik0 closed this Feb 1, 2022

saik0 mentioned this pull request Feb 6, 2022

Feature: Add select #135

Closed

3 tasks

Kerollmops mentioned this pull request Feb 8, 2022

Add a submodule of a set of real datasets for benchmarks #185

Closed

saik0 mentioned this pull request Feb 9, 2022

add runtime dataset fetch and parse in-place #186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark datasets #171

Benchmark datasets #171

saik0 commented Jan 31, 2022

Kerollmops left a comment

saik0 commented Feb 1, 2022

Kerollmops commented Feb 1, 2022

saik0 commented Feb 1, 2022

Kerollmops commented Feb 1, 2022 •

edited

Loading

saik0 commented Feb 1, 2022

Kerollmops commented Feb 8, 2022 •

edited

Loading

Benchmark datasets #171

Benchmark datasets #171

Conversation

saik0 commented Jan 31, 2022

Kerollmops left a comment

Choose a reason for hiding this comment

saik0 commented Feb 1, 2022

Kerollmops commented Feb 1, 2022

saik0 commented Feb 1, 2022

Kerollmops commented Feb 1, 2022 • edited Loading

saik0 commented Feb 1, 2022

Kerollmops commented Feb 8, 2022 • edited Loading

Kerollmops commented Feb 1, 2022 •

edited

Loading

Kerollmops commented Feb 8, 2022 •

edited

Loading