Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tcm methods to new VK version #25

Merged
merged 4 commits into from
Mar 26, 2024
Merged

Update tcm methods to new VK version #25

merged 4 commits into from
Mar 26, 2024

Conversation

camilogarciabotero
Copy link
Member

@camilogarciabotero camilogarciabotero commented Mar 18, 2024

This PR updates the transition_count_matrix to be compatible with new version of VectorizedKmers.jl (v0.9).

@camilogarciabotero camilogarciabotero self-assigned this Mar 19, 2024
Copy link
Member Author

@camilogarciabotero camilogarciabotero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of recording some unusual benches:

using BioSequences, BioMarkovChains

seq = randdnaseq(10^6)

@btime transition_count_matrix(seq); #main
1.101 ms (2 allocations: 384 bytes)

@btime transition_count_matrix(seq); #vkpatch
1.272 ms (31 allocations: 1.58 KiB)

I honestly don't know why is that happening since the underlying function count_kmers is improved in the new VectorizedKmers version. I do wonder if @anton083 do you know what might be going on here?

camilogarciabotero

This comment was marked as duplicate.

@AntonOresten
Copy link

AntonOresten commented Mar 24, 2024

Hi Camilo!

Good catch. I found the issue to be the extra overhead of calling:

kmer_array[kmer] += 1

instead of:

kmer_array.values[kmer + 1] += 1

in my count_kmers! methods.

The former method is defined like so:

@inline Base.setindex!(ka::KmerArray, v, kmer::Integer) = (ka.values[kmer + 1] = v)

with @inline and everything, so I didn't think there'd be much of a difference.

I'll be switching to the latter in the count_kmers! methods in a new patch.

Thank you for spotting this!😊
Anton

AntonOresten referenced this pull request in AntonOresten/VectorizedKmers.jl Mar 24, 2024
@AntonOresten
Copy link

AntonOresten commented Mar 24, 2024

The changes have now been pushed to patch v0.9.1.

using VectorizedKmers, BioSequences, BenchmarkTools

@btime count_kmers($(randdnaseq(10^6)), 2) # v0.8.1
  990.500 μs (28 allocations: 1.38 KiB)

@btime count_kmers($(randdnaseq(10^6)), 2) # v0.9.0
  1.054 ms (29 allocations: 1.38 KiB)

@btime count_kmers($(randdnaseq(10^6)), 2); # v0.9.1
  963.600 μs (29 allocations: 1.38 KiB)

v0.9.1 is only performing better than v0.8.1 because I'm now indexing a pre-defined kmer_array_values = kmer_array.values instead of indexing kmer_array.values every time. The tiniest changes can have significant impact at this level.😆

@camilogarciabotero
Copy link
Member Author

Thank you Anton,

The runtime performance looks nice now. I was also confused/concerned by the number and memory allocations I noticed in the benchmark. I don't know if you can reproduce it with BioMarkovChains.

@AntonOresten
Copy link

AntonOresten commented Mar 24, 2024

I'm afraid the memory allocations come from neither the KmerArray constructor, nor count_kmers! function itself, but instead the dispatching of count_kmers! within the count_kmers function. I'm not sure, but this might be due to some type-instability.

@btime begin
    kmer_array = KmerArray(4, 2)
    count_kmers!(kmer_array, seq)
end;
  958.300 μs (1 allocation: 192 bytes)

function _count_kmers(seq, K::Integer)
    kmer_array = KmerArray(4, K)
    count_kmers!(kmer_array, seq)
end;

@btime _count_kmers(seq, 2);
  961.500 μs (29 allocations: 1.38 KiB)

As soon as we put it in a function, it does some extra allocations for some reason.

And we can narrow this down to the count_kmers! call because without it we don't see the allocations:

function _count_kmers_no_counting(seq, K::Integer)
    kmer_array = KmerArray(4, K)
    #count_kmers!(kmer_array, seq)
end;

@btime _count_kmers_no_counting(4, 2);
  35.944 ns (1 allocation: 192 bytes)

But as you can see in the first snippet, the count_kmers! call itself doesn't seem to make any allocations.

Maybe relevant: https://discourse.julialang.org/t/unexpected-allocations-with-multiple-dispatch/76988

EDIT: Now in hindsight it's clear that the types in the _count_kmers function weren't stable cause K was a function argument, whereas N and K were fixed in the first block.

@AntonOresten
Copy link

AntonOresten commented Mar 24, 2024

I have discovered the problem. When alphabet size N and K-mer length K are passed as regular arguments to count_kmers their values are unknown at compile time. To fix this, the base method for count_kmers should take integers wrapped in Vals, such that it can compile for each N and K pair individually.

count_kmers1(seq, N::Int, K::Int) = count_kmers!(KmerArray(N, K), seq);

@btime count_kmers1(seq, 4, 2);
  964.000 μs (29 allocations: 1.38 KiB)

count_kmers2(seq, ::Val{N}, ::Val{K}) where {N, K} = count_kmers!(KmerArray(N, K), seq);

@btime count_kmers2(seq, Val(4), Val(2));
  958.500 μs (2 allocations: 208 bytes)

For the API to remain the same, all we need is another method that wraps integers N and K in Vals.

count_kmers2(seq, N::Int, K::Int) = count_kmers2(seq, Val(N), Val(K))

I will still make sure that there is a method that uses the default alphabet size, such that users don't need to specify the value of N.

I think I've seen the bio people use this in Kmers.jl, but I never really understood the benefit. Now I do!

AntonOresten referenced this pull request in AntonOresten/VectorizedKmers.jl Mar 24, 2024
@camilogarciabotero
Copy link
Member Author

Wow! That was a tight demonstration of problem-solving! Thanks for sharing the rationale to solve it Anton. I will now update to latest VK version!

@camilogarciabotero camilogarciabotero merged commit c2234c3 into main Mar 26, 2024
4 checks passed
@camilogarciabotero camilogarciabotero deleted the vkpatch branch March 26, 2024 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants