aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

learning-chip · 2022-02-07T16:53:49Z

Although #106 has been solved by fusion #108, the slowness of the unfused implementation (apply_edges + aggregate_neighbors) was not clearly understood. Realistic GNN models would contain mixed calls to message function, reduce function, and neural network layers, so they don't always exhibit a nice form for #108 to work.

Profiling with ProfileSVG.jl shows that 60% of time was spent on aggregate_neighbors:

Neighbor reduction with + is equivalent to either:

SpMV e * A where e is a unit vector, or
column sum of sparse matrix A

Either way turns out to be more than 100x faster than aggregate_neighbors.

Reproducible example

using SparseArrays
using GraphNeuralNetworks
import Random: seed!
using BenchmarkTools

n = 1024
seed!(0)
A = sprand(n, n, 0.01)

g = GNNGraph(
    A,
    edata=(; A=reshape(A.nzval, 1, :)),
    graph_type=:coo
)

out = aggregate_neighbors(g, +, g.edata.A)
@btime aggregate_neighbors(g, +, g.edata.A)  # ~4 ms

e = ones(1, n)
e * A == out  # true
@btime e * A  # ~20 us

sum(A, dims=1) ≈ out  # true
@btime sum(A, dims=1)  # ~10 us

Here only uses a single edge feature. Multiple edge features would correspond to a 3D sparse tensor that is not supported by SparseArrays.jl -- TACO could be used then.

Package version

[email protected]
Julia 1.7.1

The text was updated successfully, but these errors were encountered:

CarloLucibello · 2022-02-07T17:43:41Z

Nice! The comparison is not totally fair because g.edata could contain an arbitrary 1 x num_edges feature array e, therefore should first construct a matrix A out of e and then perform sum(A, dims=1). But also introducing this correction I guess it is gonna be much much faster than what aggregate_neighbors currently does.

aggregate_neighbors is just NNlib.scatter, so whatever fix we come up with should be implemented there.

I think we should introduce a fast path scatter for the case op=+ and size(e, 1) == 1 doing sum(A, dims=1). It's a bit annoying having many special cases but it is worth doing for relevant uses cases until we have better generic solutions.

AriMKatz · 2022-02-07T18:44:58Z

until we have better generic solutions.

What would those look like and at what layer in the stack? Should we open up some issues in GPUArrays or CUDA.jl? Or something in the #arrayIR slack channel?

CarloLucibello · 2022-02-07T21:32:32Z

What would those look like and at what layer in the stack? Should we open up some issues in GPUArrays or CUDA.jl? Or something in the #arrayIR slack channel?

the first step would be to integrate https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl in here for fast and parallelized sparse matrix operations. Then there is a lot of work to be done with sparse matrices support in CUDA.jl, there are a few issues open over there. Finally maybe develop sparse arrays in arbitrary dimensions and corresponding operations (see TACO)

CarloLucibello · 2022-02-07T21:40:05Z

Solving this issue was actually quite easy:

@btime aggregate_neighbors(g, +, g.edata.A) 
# 3.773 ms (52848 allocations: 3.39 MiB)   # on NNlib master
# 59.660 μs (3 allocations: 8.23 KiB)          # with https://github.com/FluxML/NNlib.jl/pull/384

CarloLucibello · 2022-02-07T21:42:15Z

Closing this as the NNlib PR is being merged

CarloLucibello mentioned this issue Feb 7, 2022

trigger op specialization in scatter FluxML/NNlib.jl#384

Merged

CarloLucibello closed this as completed Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

learning-chip commented Feb 7, 2022 •

edited

Loading

CarloLucibello commented Feb 7, 2022

AriMKatz commented Feb 7, 2022

CarloLucibello commented Feb 7, 2022 •

edited

Loading

CarloLucibello commented Feb 7, 2022 •

edited

Loading

CarloLucibello commented Feb 7, 2022

aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

Comments

learning-chip commented Feb 7, 2022 • edited Loading

Reproducible example

Package version

CarloLucibello commented Feb 7, 2022

AriMKatz commented Feb 7, 2022

CarloLucibello commented Feb 7, 2022 • edited Loading

CarloLucibello commented Feb 7, 2022 • edited Loading

CarloLucibello commented Feb 7, 2022

learning-chip commented Feb 7, 2022 •

edited

Loading

CarloLucibello commented Feb 7, 2022 •

edited

Loading

CarloLucibello commented Feb 7, 2022 •

edited

Loading