Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Various speedups #63

Open
simonmandlik opened this issue Mar 15, 2021 · 3 comments
Open

Enhancement: Various speedups #63

simonmandlik opened this issue Mar 15, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@simonmandlik
Copy link
Collaborator

simonmandlik commented Mar 15, 2021

Noting down some areas where significant speedups may be achieved:

  • vcat in ProductNodes leads to a lot of copying
  • data deduplication in leaves may lead to lower memory requirements and also to saving some compute (computing ngrams only once for identical strings in NGramMatrix multiplication)
  • deduplicating instances in BagNodes in a similar fashion
@simonmandlik simonmandlik added the enhancement New feature or request label Mar 15, 2021
@simonmandlik
Copy link
Collaborator Author

PoC for deduplication in NGramMatrices

julia> function Mill._mul(A::AbstractMatrix, S::PooledVector, n, b, m)
           C = zeros(eltype(A), size(A, 1), length(S))
           iz = Mill._init_z(n, b)
           idcs = Dict(r => Queue{Int}() for r in values(S.invpool))
           for (i,r) in S.refs |> enumerate
               enqueue!(idcs[r], i)
           end
           for (s, r) in S.invpool
               z = iz
               for l in 1:Mill._len(s, n)
                   z = Mill._next_ngram(z, l, codeunits(s), n, b)
                   zm = z%m + 1
                   for k in idcs[r]
                       for i in 1:size(C, 1)
                           @inbounds C[i, k] += A[i, zm]
                       end
                   end
               end
           end
           C
       end

julia> ss = [randstring(50), randstring(50)];
julia> S = [rand(ss) for _ in 1:100];
julia> n1 = NGramMatrix(S);
julia> n2 = NGramMatrix(PooledArray(S));
julia> x = randn(100, 2053);
julia> x*n1; @btime x*n1;
  181.145 μs (2 allocations: 78.20 KiB)

julia> x*n2; @btime x*n2;
  108.265 μs (14 allocations: 95.23 KiB)

@pevnak
Copy link
Collaborator

pevnak commented Jul 15, 2021

I guess that dedup in NGramMatrices will offer the highest benefit. Multiplication of OneHot is essentially a copying, and Dense matrices should not contain that many duplicates (although they might).

@simonmandlik
Copy link
Collaborator Author

Sure, and deduplication of instances in BagNodes as well. That said, it is possible that in some cases the vanilla version will still be faster

@simonmandlik simonmandlik changed the title Possible speedups Enhancement: Various speedups Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants