inline naive_kmer_hash and naive_miniser_hash #2967

SGSSGene · 2022-04-25T18:11:19Z

The actual goal is to remove the usage of ranges::views::sliding. This
is easiest achieved by removing complete of naive_kmer_hash and naive_minimizer_hash

This is part of seqan/product_backlog#124

-A life Without Range-v3-

vercel · 2022-04-25T18:11:55Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
seqan3	✅ Ready (Inspect)	Visit Preview	Apr 26, 2022 at 1:07PM (UTC)

codecov · 2022-04-25T18:22:14Z

Codecov Report

Merging #2967 (1763178) into master (b4984bc) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2967   +/-   ##
=======================================
  Coverage   98.22%   98.22%           
=======================================
  Files         267      267           
  Lines       11511    11511           
=======================================
  Hits        11307    11307           
  Misses        204      204

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5c63f9...1763178. Read the comment docs.

smehringer · 2022-04-26T07:38:00Z

How do the benchmark timings change with this patch?

SGSSGene · 2022-04-26T10:53:47Z

The benchmarks are very similar, for some reason there is more change in seqan_kmer_hash_ungapped (which I didn't change) than in naive_kmer_hash. Any Ideas why the seqan_kmer_hash_ungapped changed so much?

current Version:

--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
seqan_kmer_hash_ungapped/1000/8          830 ns          830 ns       759898 Throughput[bp/s]=1.19651G/s
seqan_kmer_hash_ungapped/1000/30         728 ns          728 ns       948861 Throughput[bp/s]=1.33417G/s
seqan_kmer_hash_ungapped/50000/8       39925 ns        39925 ns        17327 Throughput[bp/s]=1.25218G/s
seqan_kmer_hash_ungapped/50000/30      36520 ns        36520 ns        19142 Throughput[bp/s]=1.36831G/s
seqan_kmer_hash_gapped/1000/8           5955 ns         5955 ns       116635 Throughput[bp/s]=166.745M/s
seqan_kmer_hash_gapped/1000/30         21967 ns        21967 ns        31821 Throughput[bp/s]=44.2024M/s
seqan_kmer_hash_gapped/50000/8        296380 ns       296381 ns         2350 Throughput[bp/s]=168.678M/s
seqan_kmer_hash_gapped/50000/30      1137668 ns      1137671 ns          616 Throughput[bp/s]=43.924M/s
naive_kmer_hash/1000/8                  2213 ns         2213 ns       316473 Throughput[bp/s]=448.798M/s
naive_kmer_hash/1000/30                 7338 ns         7338 ns        94570 Throughput[bp/s]=132.32M/s
naive_kmer_hash/50000/8               110636 ns       110635 ns         6308 Throughput[bp/s]=451.871M/s
naive_kmer_hash/50000/30              377885 ns       377886 ns         1853 Throughput[bp/s]=132.238M/s

PR version

--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
seqan_kmer_hash_ungapped/1000/8          825 ns          825 ns       768144 Throughput[bp/s]=1.20313G/s
seqan_kmer_hash_ungapped/1000/30         879 ns          879 ns       790221 Throughput[bp/s]=1.10443G/s
seqan_kmer_hash_ungapped/50000/8       38567 ns        38567 ns        18220 Throughput[bp/s]=1.29627G/s
seqan_kmer_hash_ungapped/50000/30      41032 ns        41030 ns        17301 Throughput[bp/s]=1.2179G/s
seqan_kmer_hash_gapped/1000/8           6515 ns         6515 ns       106547 Throughput[bp/s]=152.407M/s
seqan_kmer_hash_gapped/1000/30         22329 ns        22329 ns        31242 Throughput[bp/s]=43.4863M/s
seqan_kmer_hash_gapped/50000/8        329464 ns       329465 ns         2120 Throughput[bp/s]=151.74M/s
seqan_kmer_hash_gapped/50000/30      1146840 ns      1146842 ns          610 Throughput[bp/s]=43.5727M/s
naive_kmer_hash/1000/8                  2163 ns         2163 ns       323162 Throughput[bp/s]=459.045M/s
naive_kmer_hash/1000/30                 7057 ns         7057 ns        97861 Throughput[bp/s]=137.595M/s
naive_kmer_hash/50000/8               107768 ns       107768 ns         6404 Throughput[bp/s]=463.894M/s
naive_kmer_hash/50000/30              363621 ns       363622 ns         1905 Throughput[bp/s]=137.426M/s

eseiler · 2022-04-26T11:02:00Z

There doesn't seem to be too much change?
You could increase the size (i.e. 1000->100'000 and 50'000 -> 5'000'000) to try and see if this avoids cache artifacts.

We also do

seqan3/test/seqan3-test.cmake

Lines 23 to 26 in e5c63f9

    
           # Force alignment of benchmarked loops so that numbers are reliable. 
        
           # For large loops and erratic seeming bench results the value might 
        
           # have to be adapted or the option deactivated. 
        
           option (SEQAN3_BENCHMARK_ALIGN_LOOPS "Pass -falign-loops=32 to the benchmark builds." ON)

The value could also be 64....but I think increasing the benchmarked sequence sizes is enough.

In general, microbenchmarks are quite finicky and might even change when recompiling :)

google benchmark also offers some options, here is an example I used for I/O:

./io/format_fasta_benchmark --benchmark_enable_random_interleaving=true \
                            --benchmark_repetitions=100 \
                            --benchmark_min_time=0.1 \
                            --benchmark_display_aggregates_only=yes \
                        | grep --color=auto mean | sort

100 repetitions is a bit much... 5 should be fine

benchmark_enable_random_interleaving is quite nice as it "randomizes" the order the benchmarks are executed

eseiler · 2022-04-26T11:11:14Z

test/performance/search/views/view_kmer_hash_benchmark.cpp

+        auto subRangeEnd   = subRangeBegin;
+        for (size_t i{1}; i < k; ++i && subRangeEnd != end(seq))
+            ++subRangeEnd;


Suggested change

auto subRangeEnd = subRangeBegin;

for (size_t i{1}; i < k; ++i && subRangeEnd != end(seq))

++subRangeEnd;

auto subRangeEnd = std::ranges::next(subRangeBegin, k - 1, end(seq));

Should be equivalent. Copies subRangeBegin, increments it k-1 times, and uses end as boundary.
The only thing I'm not sure about is the k-1 :D

Awesome, much nicer solution!

eseiler · 2022-04-26T11:12:38Z

test/performance/search/views/view_kmer_hash_benchmark.cpp

@@ -104,8 +103,22 @@ static void naive_kmer_hash(benchmark::State & state)

    for (auto _ : state)
    {
-        for (auto h : seq | seqan3::views::naive_kmer_hash(k))
+        // creates a initial subrange covering exactly 'k-1' characters
+        auto subRangeBegin = begin(seq);


naming should be subrange_begin, subrange_end, I think

Yes! much better!

The actual goal is to remove the usage of ranges::views::sliding. This is easiest achieved by removing complete of naive_kmer_hash_fn. -A life Without Range-v3-

feldroop

I am currently mainly working for the FONDA project, but I have time so I am happy to review some PRs.
So you have to live with some possibly uninformed questions. ;)

The way I understand it is this:
We want to get rid of range-v3
-> we want to get rid of ranges::views::sliding
-> the only place where it is used is in the naive_kmer_hash_fn
-> the naive_kmer_hash_fn is only used in a single benchmark, so we reimplement its use there and delete it.

My questions:

Is this really the only place where ranges::views::sliding is used in SeqAn3? I have no idea, but it seems to me like this functionality might be actually used a lot. If that's the case, wouldn't it be better to reimplement the sliding view and leave everything here as is?
Are we really sure that we will never need the naive kmer hash function for any other benchmarks? We could also keep the naive_kmer_hash_fn and replace the sliding view there with our implementation.

The overall goal is to remove the usage of ranges::views::sliding. This commit removes naive_minimiser_hash

SGSSGene · 2022-04-26T13:13:28Z

There was a second place where ranges::views::sliding was used. I updated this PR to also remove it from there.

Yes, you understand it correctly. This is the only place where ranges::views::sliding is being used (and the otherone that I just fixed ;-)). We want to remove range-v3 sooner than later. Currently, no other functionality is using the sliding-view.

This also removes only the "naive" approaches. If you actually want to use the k-mer hashes you would use one of the other views provided by seqan. Those are a lot faster.

It still is possible to implement a sliding-view in the future.

eseiler · 2022-04-26T14:15:54Z

Just adding to what @SGSSGene said:

I am currently mainly working for the FONDA project, but I have time so I am happy to review some PRs.

If you ever are busy with FONDA, you can just remove your request and reassign the team.

So you have to live with some possibly uninformed questions. ;)

Don't worry, there are no stupid questions and your questions are perfectly valid :)

The way I understand it is this: We want to get rid of range-v3 -> we want to get rid of ranges::views::sliding -> the only place where it is used is in the naive_kmer_hash_fn -> the naive_kmer_hash_fn is only used in a single benchmark, so we reimplement its use there and delete it.

👍

My questions:

Is this really the only place where ranges::views::sliding is used in SeqAn3? I have no idea, but it seems to me like this functionality might be actually used a lot. If that's the case, wouldn't it be better to reimplement the sliding view and leave everything here as is?

Are we really sure that we will never need the naive kmer hash function for any other benchmarks? We could also keep the naive_kmer_hash_fn and replace the sliding view there with our implementation.

It's the only place we use the range-v3 views::sliding. As we only us it for these two benchmarks, it would be quite a bit overhead to implement it as a view ... which we don't really have a use for.
We would implement it, if we have some proper use for it. Anyone having SeqAn3 as dependency and using the sliding view could still add range-v3 as dependency to their project.

SGSSGene requested review from a team and feldroop and removed request for a team April 25, 2022 18:11

SGSSGene mentioned this pull request Apr 25, 2022

A Life Without range-v3 seqan/product_backlog#124

Closed

eseiler reviewed Apr 26, 2022

View reviewed changes

[TEST] remove naive_kmer_hash

6772123

The actual goal is to remove the usage of ranges::views::sliding. This is easiest achieved by removing complete of naive_kmer_hash_fn. -A life Without Range-v3-

SGSSGene force-pushed the remove/range/sliding branch from c90e78b to 6772123 Compare April 26, 2022 11:29

SGSSGene requested a review from eseiler April 26, 2022 11:30

feldroop reviewed Apr 26, 2022

View reviewed changes

[TEST] remove naive/minimiser_hash

1763178

The overall goal is to remove the usage of ranges::views::sliding. This commit removes naive_minimiser_hash

SGSSGene requested a review from feldroop April 26, 2022 13:13

SGSSGene changed the title ~~[TEST] remove naive_kmer_hash~~ inline naive_kmer_hash and naive_miniser_hash Apr 26, 2022

feldroop approved these changes Apr 26, 2022

View reviewed changes

eseiler approved these changes May 3, 2022

View reviewed changes

eseiler merged commit 465f129 into seqan:master May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inline naive_kmer_hash and naive_miniser_hash #2967

inline naive_kmer_hash and naive_miniser_hash #2967

SGSSGene commented Apr 25, 2022 •

edited

Loading

vercel bot commented Apr 25, 2022 •

edited

Loading

codecov bot commented Apr 25, 2022 •

edited

Loading

smehringer commented Apr 26, 2022 •

edited

Loading

SGSSGene commented Apr 26, 2022

eseiler commented Apr 26, 2022 •

edited

Loading

eseiler Apr 26, 2022

SGSSGene Apr 26, 2022

eseiler Apr 26, 2022

SGSSGene Apr 26, 2022

feldroop left a comment

SGSSGene commented Apr 26, 2022

eseiler commented Apr 26, 2022

inline naive_kmer_hash and naive_miniser_hash #2967

inline naive_kmer_hash and naive_miniser_hash #2967

Conversation

SGSSGene commented Apr 25, 2022 • edited Loading

vercel bot commented Apr 25, 2022 • edited Loading

codecov bot commented Apr 25, 2022 • edited Loading

Codecov Report

smehringer commented Apr 26, 2022 • edited Loading

SGSSGene commented Apr 26, 2022

eseiler commented Apr 26, 2022 • edited Loading

eseiler Apr 26, 2022

Choose a reason for hiding this comment

SGSSGene Apr 26, 2022

Choose a reason for hiding this comment

eseiler Apr 26, 2022

Choose a reason for hiding this comment

SGSSGene Apr 26, 2022

Choose a reason for hiding this comment

feldroop left a comment

Choose a reason for hiding this comment

SGSSGene commented Apr 26, 2022

eseiler commented Apr 26, 2022

SGSSGene commented Apr 25, 2022 •

edited

Loading

vercel bot commented Apr 25, 2022 •

edited

Loading

codecov bot commented Apr 25, 2022 •

edited

Loading

smehringer commented Apr 26, 2022 •

edited

Loading

eseiler commented Apr 26, 2022 •

edited

Loading