-
-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revisit FeaturedGraph implementation #215
Conversation
Now @yuehhua this is ready for review |
abc31cf
to
c792c5b
Compare
bump, if we get this in I can proceed with the other changes (a lot of bug fixes and huge perf improvements) |
remove Zygote
943966d
to
e272139
Compare
@yuehhua I think this is complete. CUDA sparse support is still not very good. Hopefully it can be cleaned up a bit in futrue PRs here and in CUDA.jl (e.g. JuliaGPU/CUDA.jl#1093) |
Could you please move the |
Changes to FeaturedGraph is basically the whole PR! This work has been quite time consuming already, I fear applying these and a few future changes working on 3 packages would be a nightmare. Can we drop temporarily GraphSignals' dependence until things settle down? |
I benchmark in my side. I listed the benchmark scores for julia> df[df.layer .== "GCNConv", :]
20×8 DataFrame
Row │ N c layer gtype time_fg time_cpu time_gpu gpu_to_cpu
│ Int64? Float64? String? Symbol? Any Any Any Float64?
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 10 6.0 GCNConv adjlist TrialEstimate(401.628 ns) TrialEstimate(33.625 μs) missing Inf
2 │ 10 6.0 GCNConv dense TrialEstimate(1.475 μs) TrialEstimate(34.917 μs) missing Inf
3 │ 10 6.0 GCNConv lightgraph TrialEstimate(371.961 ns) TrialEstimate(32.292 μs) missing Inf
4 │ 10 6.0 GCNConv sparse TrialEstimate(442.078 ns) TrialEstimate(30.709 μs) missing Inf
5 │ 10 6.0 GCNConv sparsegraph TrialEstimate(8.382 μs) TrialEstimate(31.001 μs) missing Inf
6 │ 100 6.0 GCNConv adjlist TrialEstimate(2.533 μs) TrialEstimate(112.584 μs) missing Inf
7 │ 100 6.0 GCNConv dense TrialEstimate(32.333 μs) TrialEstimate(111.064 μs) missing Inf
8 │ 100 6.0 GCNConv lightgraph TrialEstimate(4.181 μs) TrialEstimate(94.834 μs) missing Inf
9 │ 100 6.0 GCNConv sparse TrialEstimate(4.148 μs) TrialEstimate(108.668 μs) missing Inf
10 │ 100 6.0 GCNConv sparsegraph TrialEstimate(80.376 μs) TrialEstimate(105.001 μs) missing Inf
11 │ 1000 6.0 GCNConv adjlist TrialEstimate(26.625 μs) TrialEstimate(760.026 μs) missing Inf
12 │ 1000 6.0 GCNConv dense TrialEstimate(1.197 ms) TrialEstimate(932.298 μs) missing Inf
13 │ 1000 6.0 GCNConv lightgraph TrialEstimate(54.167 μs) TrialEstimate(729.047 μs) missing Inf
14 │ 1000 6.0 GCNConv sparse TrialEstimate(39.834 μs) TrialEstimate(938.006 μs) missing Inf
15 │ 1000 6.0 GCNConv sparsegraph TrialEstimate(825.986 μs) TrialEstimate(887.924 μs) missing Inf
16 │ 10000 6.0 GCNConv adjlist TrialEstimate(146.918 μs) TrialEstimate(8.673 ms) missing Inf
17 │ 10000 6.0 GCNConv dense TrialEstimate(122.352 ms) TrialEstimate(8.651 ms) missing Inf
18 │ 10000 6.0 GCNConv lightgraph TrialEstimate(587.150 μs) TrialEstimate(7.807 ms) missing Inf
19 │ 10000 6.0 GCNConv sparse TrialEstimate(333.669 μs) TrialEstimate(10.331 ms) missing Inf
20 │ 10000 6.0 GCNConv sparsegraph TrialEstimate(7.334 ms) TrialEstimate(7.313 ms) missing Inf For julia> df[df.layer .== "GraphConv", :]
20×8 DataFrame
Row │ N c layer gtype time_fg time_cpu time_gpu gpu_to_cpu
│ Int64? Float64? String? Symbol? Any Any Any Float64?
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 10 6.0 GraphConv adjlist TrialEstimate(395.317 ns) TrialEstimate(255.794 μs) missing Inf
2 │ 10 6.0 GraphConv dense TrialEstimate(1.444 μs) TrialEstimate(263.335 μs) missing Inf
3 │ 10 6.0 GraphConv lightgraph TrialEstimate(369.883 ns) missing missing missing
4 │ 10 6.0 GraphConv sparse TrialEstimate(417.866 ns) TrialEstimate(210.189 μs) missing Inf
5 │ 10 6.0 GraphConv sparsegraph TrialEstimate(8.424 μs) TrialEstimate(320.670 μs) missing Inf
6 │ 100 6.0 GraphConv adjlist TrialEstimate(2.900 μs) TrialEstimate(4.173 ms) missing Inf
7 │ 100 6.0 GraphConv dense TrialEstimate(34.209 μs) TrialEstimate(4.137 ms) missing Inf
8 │ 100 6.0 GraphConv lightgraph TrialEstimate(3.852 μs) missing missing missing
9 │ 100 6.0 GraphConv sparse TrialEstimate(4.111 μs) TrialEstimate(5.943 ms) missing Inf
10 │ 100 6.0 GraphConv sparsegraph TrialEstimate(80.293 μs) TrialEstimate(4.488 ms) missing Inf
11 │ 1000 6.0 GraphConv adjlist TrialEstimate(27.250 μs) TrialEstimate(214.105 ms) missing Inf
12 │ 1000 6.0 GraphConv dense TrialEstimate(1.197 ms) TrialEstimate(199.505 ms) missing Inf
13 │ 1000 6.0 GraphConv lightgraph TrialEstimate(55.500 μs) missing missing missing
14 │ 1000 6.0 GraphConv sparse TrialEstimate(38.000 μs) TrialEstimate(281.798 ms) missing Inf
15 │ 1000 6.0 GraphConv sparsegraph TrialEstimate(821.590 μs) TrialEstimate(209.873 ms) missing Inf
16 │ 10000 6.0 GraphConv adjlist TrialEstimate(151.335 μs) missing missing missing
17 │ 10000 6.0 GraphConv dense TrialEstimate(122.266 ms) missing missing missing
18 │ 10000 6.0 GraphConv lightgraph TrialEstimate(594.670 μs) missing missing missing
19 │ 10000 6.0 GraphConv sparse TrialEstimate(335.211 μs) missing missing missing
20 │ 10000 6.0 GraphConv sparsegraph TrialEstimate(7.461 ms) missing missing missing For Sadly, I didn't find COO format in the benchmark script. We can compare the top scores and the score here. The COO format has a bit lower CPU time than sparse array for large graph in Conclusively, the best performance in I decide to take new design of |
Thank you for huge effort. This helps decide the core of featured graph. I have merged yuehhua/GraphSignals.jl#54. |
I'm really discomforted by this decision for several reasons:
So any decision steaming from performance or other considerations should be taken when this and the rest of #204 are merged. |
I can realize the integration can be made easily, but the decouple of internal representation introduce complexity. I am confused about the pursuit of simplicity and maintainability. These two conflict. So, the decoupling here is really what we want?
I showed that
That is a good point. The GPU support can also be made in the following PRs. |
I don't think I fully understand this comment. What I wanted to say is that while previously when constructing a As this PR shows, supporting 2 or 3 graph types can be done with simplicity, the code here is much more concise, maintainable and simple than the one in GraphSignals.jl and GraphLaplacians.jl which this PR replace entirely. We could also reduce the supported types to only 1 type in the future if we manage to have a single implementation with good performance on all layers and good cuda support. The changes in yuehhua/GraphSignals.jl#54 instead, where a FeatureGraph contains a SparseGraph that contains a SparseMatrix, seems to me to be going in the opposite direction of simplicity and maintainability.
I don't understand how you compared the performance of the COO implementation without running the benchmark script on this PR. In any case, it is kinda pointless comparing performance right now, because the current message massing is highly inefficient, I can obtain 10x and more speedup! We can compare performance in a meaningful way after revisiting the message passing scheme.
Not really, not in the near future. yuehhua/GraphSignals.jl#54 has been made without gpu considerations in mind. CUDA.jl support for sparse matrices is still in early stage. The only gpu implementation that we can provide right now is through the COO format, it is already there and working very well in My point is, we can have 10x and more speedup on cpu message passing, cuda support, and the possibility to switch to any other backend in the future if we want to, without having to tag other breaking releases. |
I think the key point here should be this sentence. First, you reduce the possibility of inner implementations for
So, the changes in yuehhua/GraphSignals.jl#54 is just the way towards single graph type and the same direction simplicity and maintainability.
Thus, we go into the next stage: to decide which graph type has the best performance.
Since the
Just show me how you benchmark this and compare which one to which one.
How can you see there is no gpu considerations in mind?
Which is not true. Currently, CUSPARSE support many formats, including
I got confused about why you reference an issue. An issue doesn't provide any implementation, not even saying it works well. |
This is what I did in #204, based on COO implementation but you strongly opposed it. Then I created this PR so that different implementations could be supported and compared. I said it many times already: the only GPU-compliant implementation we can have right now is COO, this is why either we switch entirely to that or we allow for a few different backends.
In order to compare we need a PR like this that allows for different backends, otherwise comparison will require a major rewrite each time. With yuehhua/GraphSignals.jl#54 to make comparisons becomes incredibly complicated.
In order to benchmark this we need the other changes from #204. I have another branch with the changes to be made on top of this PR, but since this PR it's not getting merged I don't know what to do with them. You can try to checkout #204 and adapt the benchmark script to that PR. I'm very frustrated by the fact that you chose to do yuehhua/GraphSignals.jl#54 not months ago, not in a few months, but just now, nullifying a lot of my work. This is not something that makes you feel welcome to contribute to an open source project.
It contains a lot of scalar indexing. Try to add gpu tests for each of the methods, even basic ones such as
As you can read by yourself https://github.com/JuliaGPU/CUDA.jl/blob/fb7440fb568607de4318df4b2e26ad3bd73cf851/lib/cusparse/array.jl#L115
Sorry, I meant #204 |
Sorry to get you feel frustrated to contribute a project. It begins from #204. It gives too much redesign and makes it hard to review. Most importantly, the redesign collapses the backbone of this project and I already hinted you to check the paper first. But it doesn't work. Secondly, we have discuss new designs in #200, #201 and #202 and I agree with some points, but not all. But you insist to go your own way. I am forced to review the code from the part I disagree and it do cost my time. Obviously, we didn't reach a consensus. However, I am still glad to have these discussions, which clarify our minds and help us making some decisions. There are not always PR got merged and I give reasons as well. These issues and PRs are not worthless. Sorry not accepting your PR. I think I will end here. |
It's fine to have diverging opinions and since you are the author of the package the ultimate responsibility to steer the direction of the package is yours. I feel there are some major design issues with the current situation that I wasn't given the opportunity to address. My interest is for the Julia ecosystem to have a simple and performant graph neural network library. Since in my opinion GeometricFlux.jl doesn't meet these requirements nor is willing to, I will consider forking/creating a new GNN package. That said, let me say that even though we have different design opinions, I highly appreciate the work you have done here. |
This is the part of #204 concerning only the FeaturedGraph implementation.
With respect to #204 and similarly to the current design, the implementation in this PR allows different underlying graph representations. In particular we have:
:coo
: this is the representation that is expected to be the most performant for message passing schemes. The benefits of this representation are not visible in this PR but once other parts of COO FeaturedGraph #204 get merged I expect an order of magnitude increase in the performance of message passing layers (+ we would have cuda support for all layers which is currently broken):dense
: graph represented as a dense adjacency matrix. This is could be a good representation for small layers performing algebraic operations but doesn't scale well to large graphs. CUDA support for all layers will be feasible once we get other parts of COO FeaturedGraph #204 in.:sparse
. This is a representation in terms of a sparse adjacency matrix. It is expected to be the most performant implementation for algebraic layers and also for some of the message passing layers, since it avoids materializing the node features on each edge. It scales well to large graphs. Unfortunately, CUDA support will be very limited for quite some time.I looked a bit into CuSparseMatrixCSC, which is what you get when you do
Unfortunately its array interface is very limited, e.g. broadcasting is not supported, so I fear we won't be able to rely on it for a long time. On the other hand, cpu support is already very good with this PR, in the tests I get a single error when trying to compute the eigenvalues of the laplacian, probably I will be able to fix it in this PR.
Two important differences with the current
FeaturedGraph
graph_type
.After this PR gets in, we should do a careful performance analysis (see #214) and define specialized methods selecting the most performant path:
A byproduct of this PR is that it brings back gpu support for some graph/layers combinations (partial fix to #195).
This is the output of a performance script similar to #214 I run on this PR. I added an extra column
time_fg
timing the featured graph creation.I hope this can address some of the concerns expressed by @yuehhua in #204
Fix #200, Fix #217, Fix https://github.com/yuehhua/GraphSignals.jl/issues/50