-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] interleaved_bloom_filter::clear #2428
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/seqan/seqan3/HaWHXnUHX5FSQy9vaoSsHANy3FG5 |
ca43998
to
ca49b11
Compare
Codecov Report
@@ Coverage Diff @@
## master #2428 +/- ##
=======================================
Coverage 98.24% 98.24%
=======================================
Files 267 267
Lines 10917 10929 +12
=======================================
+ Hits 10725 10737 +12
Misses 192 192
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that is a really useful function. I just have two questions, the rest looks good to me. :)
test/performance/search/dream_index/interleaved_bloom_filter_benchmark.cpp
Show resolved
Hide resolved
test/performance/search/dream_index/interleaved_bloom_filter_benchmark.cpp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. :)
3008624
to
58a3f8e
Compare
58a3f8e
to
5a54705
Compare
//!\cond | ||
requires (data_layout_mode == data_layout::uncompressed) | ||
//!\endcond | ||
void clear(rng_t && bin_range) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this function only exists in the case that bins are close to each other and clearing them in bulk would be more cache efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that would assume sorting on the bins. I guess you collect all bins to be cleared in a vector. Do you collect them in-order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this function only exists in the case that bins are close to each other and clearing them in bulk would be more cache efficient?
Yes, has a similar effect as having a range overload for emplace
.
I did some benchmarks, and you have some cache effects, the more bins you have and the bigger the IBF, the better the speedup (inversely, for small IBFs (a few MiB) and low bin count (<256), the non-range overload is faster). I did not look into how many bins you need to clear at once.
I saw x10 for 4 GiB and 1024 bins for clearing all bins.
But that would assume sorting on the bins. I guess you collect all bins to be cleared in a vector. Do you collect them in-order?
I kinda assume sorting - I don't know how the bins are collected in an application.
I could:
- Sort the range (I would then require
std::ranges::random_access_range
)- If I do this, I might as well make the range unique?
- Document that it would be good if the range was sorted (and unique).
- Remove the overload because it is improbable that many close bins are cleared at the same time.
- Leave it there, but hide it in the documentation until we tested more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, has a similar effect as having a range overload for
emplace
.
We don't have one right?
I kinda assume sorting - I don't know how the bins are collected in an application. I could:
I'm torn because I know that this can have a performance impact. The main problem is that the data layout is in stripes and I would assume that most bins (~1024) fit in one cache line and even "few" bins should see some benefit from that even if not sorted (since they fall in the same cache line).
Form an interface perspective a range "for a" clear is somehow strange.
What I'm currently thinking is whether we can use the "binning_bitvector" to clear the bins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have one right?
Not yet, but we should add it soon.
Form an interface perspective a range "for a" clear is somehow strange.
erase
would probably be a better name; however erase
usually also does not have a range interface
What I'm currently thinking is whether we can use the "binning_bitvector" to clear the bins?
How do you mean exactly?
But now that you mention it, it would probably be better to maybe manipulate words instead of bits.
I.e. erasing bins [1,2,3,63] would be one "word access", but 4 "bit accesses".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you mean exactly?
I was asking myself: "When do you clear?" Probably because of a "contain" query. So can't we use the same data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to erase a bin as part of an update procedure.
If you build an IBF over a database and, say, genome_1
is in bin_1
, but then in an update genome_1
gets deleted, you will want to erase bin_1
.
So basically, you need erase
to do remove and edit of bins affected by some database update.
resolves seqan/product_backlog#296