Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] interleaved_bloom_filter::clear #2428

Merged
merged 1 commit into from
Mar 8, 2021

Conversation

eseiler
Copy link
Member

@eseiler eseiler commented Mar 4, 2021

@vercel
Copy link

vercel bot commented Mar 4, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/seqan/seqan3/HaWHXnUHX5FSQy9vaoSsHANy3FG5
✅ Preview: https://seqan3-git-fork-eseiler-feature-ibfclear-seqan.vercel.app

@eseiler eseiler force-pushed the feature/ibf_clear branch 4 times, most recently from ca43998 to ca49b11 Compare March 4, 2021 16:30
@eseiler eseiler requested a review from MitraDarja March 4, 2021 16:34
@codecov
Copy link

codecov bot commented Mar 4, 2021

Codecov Report

Merging #2428 (f5d4acd) into master (08bab5d) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2428   +/-   ##
=======================================
  Coverage   98.24%   98.24%           
=======================================
  Files         267      267           
  Lines       10917    10929   +12     
=======================================
+ Hits        10725    10737   +12     
  Misses        192      192           
Impacted Files Coverage Δ
...n3/search/dream_index/interleaved_bloom_filter.hpp 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08bab5d...5a54705. Read the comment docs.

Copy link
Contributor

@MitraDarja MitraDarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that is a really useful function. I just have two questions, the rest looks good to me. :)

@eseiler eseiler requested a review from MitraDarja March 4, 2021 17:58
Copy link
Contributor

@MitraDarja MitraDarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. :)

@eseiler eseiler requested a review from marehr March 4, 2021 18:29
@eseiler eseiler force-pushed the feature/ibf_clear branch from 3008624 to 58a3f8e Compare March 5, 2021 11:37
@eseiler eseiler force-pushed the feature/ibf_clear branch from 58a3f8e to 5a54705 Compare March 5, 2021 11:38
//!\cond
requires (data_layout_mode == data_layout::uncompressed)
//!\endcond
void clear(rng_t && bin_range) noexcept
Copy link
Member

@marehr marehr Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this function only exists in the case that bins are close to each other and clearing them in bulk would be more cache efficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would assume sorting on the bins. I guess you collect all bins to be cleared in a vector. Do you collect them in-order?

Copy link
Member Author

@eseiler eseiler Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this function only exists in the case that bins are close to each other and clearing them in bulk would be more cache efficient?

Yes, has a similar effect as having a range overload for emplace.
I did some benchmarks, and you have some cache effects, the more bins you have and the bigger the IBF, the better the speedup (inversely, for small IBFs (a few MiB) and low bin count (<256), the non-range overload is faster). I did not look into how many bins you need to clear at once.
I saw x10 for 4 GiB and 1024 bins for clearing all bins.

But that would assume sorting on the bins. I guess you collect all bins to be cleared in a vector. Do you collect them in-order?

I kinda assume sorting - I don't know how the bins are collected in an application.
I could:

  • Sort the range (I would then require std::ranges::random_access_range)
    • If I do this, I might as well make the range unique?
  • Document that it would be good if the range was sorted (and unique).
  • Remove the overload because it is improbable that many close bins are cleared at the same time.
  • Leave it there, but hide it in the documentation until we tested more.

Copy link
Member

@marehr marehr Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, has a similar effect as having a range overload for emplace.

We don't have one right?

I kinda assume sorting - I don't know how the bins are collected in an application. I could:

I'm torn because I know that this can have a performance impact. The main problem is that the data layout is in stripes and I would assume that most bins (~1024) fit in one cache line and even "few" bins should see some benefit from that even if not sorted (since they fall in the same cache line).

Form an interface perspective a range "for a" clear is somehow strange.

What I'm currently thinking is whether we can use the "binning_bitvector" to clear the bins?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have one right?

Not yet, but we should add it soon.

Form an interface perspective a range "for a" clear is somehow strange.

erase would probably be a better name; however erase usually also does not have a range interface

What I'm currently thinking is whether we can use the "binning_bitvector" to clear the bins?

How do you mean exactly?
But now that you mention it, it would probably be better to maybe manipulate words instead of bits.
I.e. erasing bins [1,2,3,63] would be one "word access", but 4 "bit accesses".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you mean exactly?

I was asking myself: "When do you clear?" Probably because of a "contain" query. So can't we use the same data type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to erase a bin as part of an update procedure.
If you build an IBF over a database and, say, genome_1 is in bin_1, but then in an update genome_1 gets deleted, you will want to erase bin_1.

So basically, you need erase to do remove and edit of bins affected by some database update.

@eseiler eseiler requested a review from marehr March 5, 2021 15:06
@eseiler eseiler merged commit 696b6e3 into seqan:master Mar 8, 2021
@eseiler eseiler deleted the feature/ibf_clear branch May 14, 2021 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clear() for the interleaved_bloom_filter
3 participants