propagate() is 20x slower than built-in sparse matmul #106

learning-chip · 2022-01-19T02:35:21Z

With the well-known graph-matrix duality (see GraphBLAS intro, Fig. 1), simple graph message passing kernels are equivalent to sparse matrix-vector multiplication (SpMV) or sparse matrix-matrix multiplication (SpMM). However, I notice that propagate() is more than 20x slower than the built-in A * B for an equivalent operation. I did the same test with DGL, and did not observe such drastic slow down.

To reproduce

using SparseArrays
using GraphNeuralNetworks
using BenchmarkTools
import Random: seed!

n = 1024
seed!(0)
A = sprand(n, n, 0.01)
b = rand(1, n)
B = rand(100, n)

g = GNNGraph(
    A,
    ndata=(; b=b, B=B),
    edata=(; A=reshape(A.nzval, 1, :)),
    graph_type=:coo  # changing to :sparse has little effect on performance
)

function spmv(g)
    propagate(
        (xi, xj, e) -> e .* xj ,  # same as e_mul_xj
        g, +; xj=g.ndata.b, e=g.edata.A
        )
end

function spmm(g)
    propagate(
        (xi, xj, e) -> e .* xj ,  # same as e_mul_xj
        g, +; xj=g.ndata.B, e=g.edata.A
        )
end

isequal(spmv(g),  b * A)  # true
@btime spmv(g)  # ~5 ms
@btime b * A  # ~32 us

isequal(spmm(g), B * A)  # true
@btime spmm(g)  # ~9 ms
@btime B * A  # ~400 us

Such performance gap can't be explained by storing the sparse matrix in COO (GNN libraries' default) vs CSR (SciPy default) vs CSC (Julia default). In the code below, changing scipy matrix format has minor effect on speed. Also, the speed of DGL and SciPy are similar.

Compare with DGL and SciPy

import numpy as np
import scipy.sparse as sp
import torch

import dgl
import dgl.function as fn

n = 1024

np.random.seed(0)
A = sp.random(n, n, density=0.01, format='csc')  # changing format to `coo` or `csr` affects performance, but not much
b = np.random.rand(1, n)
B = np.random.rand(100, n)

g = dgl.from_scipy(A)
g.edata['A'] = torch.tensor(A.data[:, np.newaxis])
g.ndata['b'] = torch.tensor(b.T)
g.ndata['B'] = torch.tensor(B.T)

def spmv(g):
    with g.local_scope():
        g.update_all(fn.e_mul_u('A', 'b', 'm'), fn.sum('m', 'bA'))
        return g.ndata['bA']
    
def spmm(g):
    with g.local_scope():
        g.update_all(fn.e_mul_u('A', 'B', 'M'), fn.sum('M', 'BA'))
        return g.ndata['BA']

np.array_equal(spmv(g).numpy().T, b @ A)  # True
%timeit spmv(g)  # ~200 us
%timeit b @ A  # ~70 us

np.array_equal(spmm(g).numpy().T, B @ A)  # True
%timeit spmm(g)  # ~900 us
%timeit B @ A  # ~900 us

Effect of fusion

DGL's update_all fuses the message and reduction kernels. To mimic the two-stage propagate, and see if fusion causes such performance difference:

def spmv_twostage(g):
    with g.local_scope():
        g.apply_edges(fn.e_mul_u('A', 'b', 'm'))
        g.update_all(
            fn.copy_e('m', 'm'),
            fn.sum('m', 'bA')
        )
        return g.ndata['bA']

%timeit spmv_twostage(g)  # ~240 us; just 20% slower

The unfused version is only slightly slower. It cannot explain the 5ms vs 200us performance gap.

There must be other causes of inefficiency. I'd like to figure it out and bring the performance at least close to DGL. (I use DGL a lot, but there are certain projects that favor all-Julia implementation, and your package seems a good option with a clean syntax🙂 )

Package version

Jl:

GraphNeuralNetworks.jl 0.3.8
Julia 1.7

Py:

DGL 0.7.1
PyTorch 1.9.1

The text was updated successfully, but these errors were encountered:

learning-chip · 2022-01-19T07:53:13Z

Also compared with PyG 2.0.1, which shows good performance with a single node feature b, but much slower performance with multiple edge features B.

An interesting observation here is that, the PyG SpMM scales almost linearly with the size (# of rows) of B. Multiplying a size (100, n) matrix is just 100x slower than multiplying a size (1, n) vector. Feels like spmv is called 100 times, rather than a single spmm kernel with better data locality.

(Recall that DGL performs well on both spmv and spmm cases, and is still the best reference)

Reproducer:

import numpy as np
import scipy.sparse as sp
import torch

from torch_geometric.data import Data
from torch_geometric.nn import MessagePassing

n = 1024

np.random.seed(0)
A = sp.random(n, n, density=0.01, format='coo')
b = np.random.rand(1, n)
B = np.random.rand(100, n)

edge_index = torch.tensor([A.row, A.col], dtype=torch.long)
data = Data(edge_index=edge_index, A=torch.tensor(A.data),
            B=torch.tensor(B), b=torch.tensor(b))

class SpMVLayer(MessagePassing):
    def __init__(self):
        super().__init__(aggr="add", node_dim=-1)
        
    def forward(self, x, edge_index, A):
        return self.propagate(edge_index, x=x, A=A)

    def message(self, x_j, A):
        m = x_j * A
        return m

spmv = SpMVLayer()

BA_gnn = spmv(data.B, data.edge_index, data.A)
np.array_equal(BA_gnn.numpy(), B @ A)  # True
%timeit spmv(data.B, data.edge_index, data.A)  # ~10ms, 10x slower than SciPy

bA_gnn = spmv(data.b, data.edge_index, data.A)
np.array_equal(bA_gnn.numpy(), b @ A)  # True
%timeit spmv(data.b, data.edge_index, data.A)  # ~100us, as fast as SciPy

CarloLucibello · 2022-01-19T08:13:11Z

Thanks, this benchmark is very useful. For this specific operation of the e_mul_xj type, we can implement a specialized fusing propagate, similarly to what we currently have for copy_xj. In fact for copy_xj the performance is on-par with matrix multiplication:

function spmm_copyxj_fused(g)
    propagate(
        copy_xj,
        g, +; xj=g.ndata.B
        )
end

function spmm_copyxj_unfused(g)
    propagate(
        (xi, xj, e) -> xj,
        g, +; xj=g.ndata.B
        )
end

Adj = map(x -> x > 0 ? 1 : 0, A)
@assert spmm_copyxj_unfused(g) ≈ B * Adj
@assert spmm_copyxj_fused(g) ≈ B * A # bug for weighted graphs fixed in #107

@btime spmm_copyxj_fused(g)  # 268.614 μs (22 allocations: 1.13 MiB)
@btime spmm_copyxj_unfused(g)  # 4.263 ms (52855 allocations: 12.23 MiB)
@btime B * Adj  # 196.135 μs (2 allocations: 800.05 KiB)

We can get more generic fusing integrating
https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl
as discussed a bit with @Wimmerer.

As to why the unfused operation based on gather/scatter are so slow and if can be improved, this has to be analyzed

learning-chip · 2022-01-19T09:19:34Z

We can get more generic fusing integrating
https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl
as discussed a bit with @Wimmerer.

Thanks, this seems a very promising direction -- is there a dedicated issue to plan and track it? #19 ?

CarloLucibello · 2022-01-19T09:29:54Z

yes we could open a separate issue. I'm not familiar yet with SuiteSparseGraphBLAS so I didn't think about a proper integration plan. The first step would just be a drop-in sparse matrix replacement as in #19 I guess

rayegun · 2022-01-19T21:33:30Z

I'm beginning to work on this area again. The primary issue is still missing rrules for the min/max reduction cases. But I'll make a PR without those by the end of the week.

Those missing rrules are possible, and I'm tackling that issue from 3-4 directions.

CarloLucibello · 2022-01-19T22:15:00Z

We don't necessarily have to takle all together. We can dispatch just on + at the beginning and use gather/scatter for the other reductions

CarloLucibello · 2022-01-20T01:27:54Z

With #107 the example in the OP now gives good results

function spmm_unfused(g)
    propagate(
        (xi, xj, e) -> e .* xj ,  # same as e_mul_xj
        g, +; xj=g.ndata.B, e=g.edata.A
        )
end

function spmm_fused(g)
    propagate(
        e_mul_xj,
        g, +; xj=g.ndata.B, e=vec(g.edata.A)
        )
end

@assert isequal(spmm_fused(g), B * A)  # true
@assert isequal(spmm_unfused(g), B * A)  # true
@btime spmm_fused(g)  #   271.228 μs (33 allocations: 1.21 MiB)
@btime spmm_unfused(g)  # 4.324 ms (52855 allocations: 12.23 MiB)
@btime B * A  # 196.115 μs (2 allocations: 800.05 KiB)

learning-chip · 2022-01-21T04:01:42Z

Confirmed the speed-up from fusion with #107. Thanks for such quick response.

spmv kernel now takes 200 us (vs 5ms previously, vs 32 us for builtin b * A)
spmm kernel now takes 600 us (vs 9ms previously, vs 400 us for builtin B * A)

Also notice that, the speed-up is gone with passing e=g.edata.A (size (1, num_edges) matrix) instead of e=vec(g.edata.A) (1D vector). This is understandable, because multiple edge features represented by a (num_features, num_edges) (dense) matrix correspond to a 3D sparse tensor of shape (num_features, row, col), which is beyond the common, 2D sparse linear algebra.

Nonetheless, the case of multiple edge features could have a large room for performance optimization. A simple reasoning is: we know that the special case num_features=1 (without flattening to 1D vector) is not optimized (unfused), resulting in 20x performance gap from optimal. The general num_features > 1 case should inherit the same source of inefficiency, compared to specialized sparse tensor library like TACO, or just compared to calling the built-in B * A spmm kernel num_features times.

3D sparse tensor is outside of standard GraphBLAS, so I doubt if SuiteSparseGraphBLAS.jl works for such case?

CarloLucibello · 2022-01-21T06:53:07Z

For multiple edge features, one could use NNlib.batched_mul or Tullio.jl, but the main problem is that julia doesn't have well supported sparse tensor types AFAIK

rayegun · 2022-01-21T19:35:59Z

3D sparse tensor is outside of standard GraphBLAS, so I double if SuiteSparseGraphBLAS.jl works for such case?

No it doesn't. We would flatten when using GraphBLAS for that case.

specialized sparse tensor library like TACO

That's what TACO generating Julia is for :), which is pretty close!

For multiple edge features, one could use NNlib.batched_mul or Tullio.jl, but the main problem is that julia doesn't have well supported sparse tensor types AFAIK

Yes I think this will work fine for now. I'll have a TACO.jl and associated packages to provide those eventually, hopefully in the next few months.

We can take care of speeding up num_features > 1 as that comes around.

My plan right now: SuiteSparse:GraphBLAS where it's possible, TACO everywhere else. SS:GrB will be faster in most cases (often significantly), but TACO will provide backup.

rayegun · 2022-01-25T11:43:16Z

Also one other thing I forgot to mention. In the medium-term (probably 1-a few years) SuiteSparse:GraphBLAS will support fast user defined datatypes. This makes it pretty easy to add support for multidimensional edge features by just having a tuple/vector type.

I will be exposing the current functionality to do so shortly in SS:GrB, although user defined functions/datatypes are currently quite slow since they use pointers for everything.

So we will hopefully have two backends with different tradeoffs: TACO.jl and SS:GrB, both of which should support the full set of computations required with different speed tradeoffs depending on the computation.

learning-chip · 2022-02-12T10:25:34Z

So we will hopefully have two backends with different tradeoffs: TACO.jl and SS:GrB

Just notice a promising new work Compiler Support for Sparse Tensor Computations in MLIR. Table 1-4 contain comparison to TACO. Worth considering.

rayegun · 2022-02-12T17:26:02Z

Ooh I'm glad that finally came out, Aart has been working on that dialect for a while.

CarloLucibello mentioned this issue Jan 19, 2022

add fusing propagate specialization for e_mul_xj #108

Closed

CarloLucibello mentioned this issue Jan 20, 2022

fused e_mul_xj and weighted option for adjacency_matrix #107

Merged

CarloLucibello closed this as completed in #107 Jan 20, 2022

learning-chip mentioned this issue Feb 7, 2022

aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

propagate() is 20x slower than built-in sparse matmul #106

propagate() is 20x slower than built-in sparse matmul #106

learning-chip commented Jan 19, 2022 •

edited

Loading

learning-chip commented Jan 19, 2022 •

edited

Loading

CarloLucibello commented Jan 19, 2022 •

edited

Loading

learning-chip commented Jan 19, 2022

CarloLucibello commented Jan 19, 2022

rayegun commented Jan 19, 2022

CarloLucibello commented Jan 19, 2022

CarloLucibello commented Jan 20, 2022 •

edited

Loading

learning-chip commented Jan 21, 2022 •

edited

Loading

CarloLucibello commented Jan 21, 2022

rayegun commented Jan 21, 2022 •

edited

Loading

rayegun commented Jan 25, 2022 •

edited

Loading

learning-chip commented Feb 12, 2022

rayegun commented Feb 12, 2022

propagate() is 20x slower than built-in sparse matmul #106

propagate() is 20x slower than built-in sparse matmul #106

Comments

learning-chip commented Jan 19, 2022 • edited Loading

To reproduce

Compare with DGL and SciPy

Effect of fusion

Package version

learning-chip commented Jan 19, 2022 • edited Loading

CarloLucibello commented Jan 19, 2022 • edited Loading

learning-chip commented Jan 19, 2022

CarloLucibello commented Jan 19, 2022

rayegun commented Jan 19, 2022

CarloLucibello commented Jan 19, 2022

CarloLucibello commented Jan 20, 2022 • edited Loading

learning-chip commented Jan 21, 2022 • edited Loading

CarloLucibello commented Jan 21, 2022

rayegun commented Jan 21, 2022 • edited Loading

rayegun commented Jan 25, 2022 • edited Loading

learning-chip commented Feb 12, 2022

rayegun commented Feb 12, 2022

learning-chip commented Jan 19, 2022 •

edited

Loading

learning-chip commented Jan 19, 2022 •

edited

Loading

CarloLucibello commented Jan 19, 2022 •

edited

Loading

CarloLucibello commented Jan 20, 2022 •

edited

Loading

learning-chip commented Jan 21, 2022 •

edited

Loading

rayegun commented Jan 21, 2022 •

edited

Loading

rayegun commented Jan 25, 2022 •

edited

Loading