-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Various speedups #63
Labels
enhancement
New feature or request
Comments
PoC for deduplication in NGramMatrices julia> function Mill._mul(A::AbstractMatrix, S::PooledVector, n, b, m)
C = zeros(eltype(A), size(A, 1), length(S))
iz = Mill._init_z(n, b)
idcs = Dict(r => Queue{Int}() for r in values(S.invpool))
for (i,r) in S.refs |> enumerate
enqueue!(idcs[r], i)
end
for (s, r) in S.invpool
z = iz
for l in 1:Mill._len(s, n)
z = Mill._next_ngram(z, l, codeunits(s), n, b)
zm = z%m + 1
for k in idcs[r]
for i in 1:size(C, 1)
@inbounds C[i, k] += A[i, zm]
end
end
end
end
C
end
julia> ss = [randstring(50), randstring(50)];
julia> S = [rand(ss) for _ in 1:100];
julia> n1 = NGramMatrix(S);
julia> n2 = NGramMatrix(PooledArray(S));
julia> x = randn(100, 2053);
julia> x*n1; @btime x*n1;
181.145 μs (2 allocations: 78.20 KiB)
julia> x*n2; @btime x*n2;
108.265 μs (14 allocations: 95.23 KiB) |
I guess that dedup in NGramMatrices will offer the highest benefit. Multiplication of OneHot is essentially a copying, and Dense matrices should not contain that many duplicates (although they might). |
Sure, and deduplication of instances in |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Noting down some areas where significant speedups may be achieved:
vcat
inProductNode
s leads to a lot of copyingNGramMatrix
multiplication)BagNode
s in a similar fashionThe text was updated successfully, but these errors were encountered: