Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

Closed
learning-chip opened this issue Feb 7, 2022 · 5 comments

Comments

@learning-chip
Copy link

learning-chip commented Feb 7, 2022

Although #106 has been solved by fusion #108, the slowness of the unfused implementation (apply_edges + aggregate_neighbors) was not clearly understood. Realistic GNN models would contain mixed calls to message function, reduce function, and neural network layers, so they don't always exhibit a nice form for #108 to work.

Profiling with ProfileSVG.jl shows that 60% of time was spent on aggregate_neighbors:

prof_spmm

Neighbor reduction with + is equivalent to either:

  • SpMV e * A where e is a unit vector, or
  • column sum of sparse matrix A

Either way turns out to be more than 100x faster than aggregate_neighbors.

Reproducible example

using SparseArrays
using GraphNeuralNetworks
import Random: seed!
using BenchmarkTools

n = 1024
seed!(0)
A = sprand(n, n, 0.01)

g = GNNGraph(
    A,
    edata=(; A=reshape(A.nzval, 1, :)),
    graph_type=:coo
)

out = aggregate_neighbors(g, +, g.edata.A)
@btime aggregate_neighbors(g, +, g.edata.A)  # ~4 ms

e = ones(1, n)
e * A == out  # true
@btime e * A  # ~20 us

sum(A, dims=1)  out  # true
@btime sum(A, dims=1)  # ~10 us

Here only uses a single edge feature. Multiple edge features would correspond to a 3D sparse tensor that is not supported by SparseArrays.jl -- TACO could be used then.

Package version

@CarloLucibello
Copy link
Member

Nice! The comparison is not totally fair because g.edata could contain an arbitrary 1 x num_edges feature array e, therefore should first construct a matrix A out of e and then perform sum(A, dims=1). But also introducing this correction I guess it is gonna be much much faster than what aggregate_neighbors currently does.

aggregate_neighbors is just NNlib.scatter, so whatever fix we come up with should be implemented there.

I think we should introduce a fast path scatter for the case op=+ and size(e, 1) == 1 doing sum(A, dims=1). It's a bit annoying having many special cases but it is worth doing for relevant uses cases until we have better generic solutions.

@AriMKatz
Copy link

AriMKatz commented Feb 7, 2022

until we have better generic solutions.

What would those look like and at what layer in the stack? Should we open up some issues in GPUArrays or CUDA.jl? Or something in the #arrayIR slack channel?

@CarloLucibello
Copy link
Member

CarloLucibello commented Feb 7, 2022

What would those look like and at what layer in the stack? Should we open up some issues in GPUArrays or CUDA.jl? Or something in the #arrayIR slack channel?

the first step would be to integrate https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl in here for fast and parallelized sparse matrix operations. Then there is a lot of work to be done with sparse matrices support in CUDA.jl, there are a few issues open over there. Finally maybe develop sparse arrays in arbitrary dimensions and corresponding operations (see TACO)

@CarloLucibello
Copy link
Member

CarloLucibello commented Feb 7, 2022

Solving this issue was actually quite easy:

@btime aggregate_neighbors(g, +, g.edata.A) 
# 3.773 ms (52848 allocations: 3.39 MiB)   # on NNlib master
# 59.660 μs (3 allocations: 8.23 KiB)          # with https://github.com/FluxML/NNlib.jl/pull/384

@CarloLucibello
Copy link
Member

Closing this as the NNlib PR is being merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants