Optimization of the graph network #48

raimis · 2021-11-23T16:31:41Z

Optimization of the graph network (TorchMD_GN) with NNPOps (https://github.com/openmm/NNPOps).

In a special case, TorchMD_GN is equivalent to SchNet (#45 (comment)), which is already supported by NNPOps:

TorchMD_GN(rbf_type="gauss", trainable_rbf=False, activation="ssp", neighbor_embedding=False)

Implement PyTorch wrapper for CFConvNeighbors and CFConv -- PyTorch wrapper for SchNet operations openmm/NNPOps#40
Accelerate the limited TorchMD_GN with NNPOps -- Accelerate the limited TorchMD_GN with NNPOps #50
Update the installation instructions -- Update the installation instructions #55
- NNPOps package -- Conda-forge package of NNPOps openmm/NNPOps#26
- PyTorch Geometric package -- Make compatible with conda-forge #53

In general, TorchMD_GN needs these:

TorchMD_GN(rbf_type="expnorm", trainable_rbf=True, activation="silu", neighbor_embedding=True)

Implement the exponentially-modified Gaussian in CFConv (rbf_type="expnorm")
Allow to pass arbitrary RBF positions to CFConv (trainable_rbf=True)
Implement the SILU activation in CFConv (activation="silu")
Reuse CFConv to accelerate the neighbor embedding (neighbor_embedding=True)

The text was updated successfully, but these errors were encountered:

raimis · 2021-11-24T13:52:35Z

Regarding the interface, it should look and work like this:

 # Create or load a model in any way
model = TorchMD_GN()

# Optional: train or do what ever you want with the model

# Optimize the model
from torchmdnet.optimize import optimize
optimized_model = optimize(model, some_optimization_options)

# Do the inference with the model
results = optimized_model.forward(z, pos, batch)

# Optional: convert the model into TorchScript and save for external use (e.g. OpenMM-Torch)
torch.jit.script(optimized_model).save('model.pt')

It is similar, what is being implemented for the TorchANI optimization (https://github.com/raimis/NNPOps/blob/opt_ani/README.md#example).

@PhilippThoelke @stefdoerr @giadefa any comments?

raimis · 2021-12-09T12:39:16Z

For a moment, it seems all the PyTorch-Geometric packages are broken (pyg-team/pytorch_geometric#3660).

raimis · 2022-02-22T12:02:15Z

@peastman I have just finished integrating NNPOps (#50). The performance (https://github.com/torchmd/torchmd-net/blob/main/benchmarks/graph_network.ipynb) is just 2-3 time better for the small molecules (10-100 atoms) and no significant improvement for the larger ones.

I'll try to profile to get a better insight. At some, we should discuss, if we can make any further improvements.

cc: @giadefa

peastman · 2022-02-22T16:45:52Z

It would be useful to separate out all the different optimizations in NNPOps. Can you identify the effect of each one separately?

Back when we first started designing it, we discussed requirements and decided it would be optimized for molecules of about 100 atoms. The code is all designed around that assumption. If we want good performance on much larger molecules, it would need to be written differently. For example, it uses a O(n^2) algorithm to build the neighbor list, which is very fast for small molecules and very slow for large ones.

giadefa · 2022-02-22T16:55:44Z

100 particles is a good case. However, we have two ways of running multiple simulations. In one we batch them where the same molecule is run in a single NN batch for forces and your kernel does not batch. In another way, we simply make multiple copies, e.g. 64 far enough and run it as a single system but in this case the system size is more like 6400 particles. I think that we need batching in the CUDA kernel for identical molecules and cell lists which are very fast. g

…

On Tue, Feb 22, 2022 at 5:46 PM Peter Eastman ***@***.***> wrote: It would be useful to separate out all the different optimizations in NNPOps. Can you identify the effect of each one separately? Back when we first started designing it, we discussed requirements and decided it would be optimized for molecules of about 100 atoms. The code is all designed around that assumption. If we want good performance on much larger molecules, it would need to be written differently. For example, it uses a O(n^2) algorithm to build the neighbor list, which is very fast for small molecules and very slow for large ones. — Reply to this email directly, view it on GitHub <#48 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOQDRFKIOXWBGYPNER3U4O4UXANCNFSM5IT6CSDQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

peastman · 2022-02-22T17:14:56Z

That would definitely need code changes to be efficient. You want it to know it only needs to check each atom against the other atoms in its own copy, not all the other copies. Spreading the copies out through space is also inaccurate. The further an atom is from the origin, the less precisely its position can be specified.

giadefa · 2022-02-22T17:19:09Z

On Tue, Feb 22, 2022 at 6:15 PM Peter Eastman ***@***.***> wrote: That would definitely need code changes to be efficient. You want it to know it only needs to check each atom against the other atoms in its own copy, not all the other copies. Spreading the copies out through space is also inaccurate. The further an atom is from the origin, the less precisely its position can be specified.

Yes, the accuracy is a problem but it is quite efficient done this way. Some tests on forces showed reasonable results though but I agree that it is a problem. The other way is batching, can you add batching of multiple copies of the same molecules in your CUDA kernel? g

…

— Reply to this email directly, view it on GitHub <#48 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOQ5S5EDVZCAV6YRUILU4PABXANCNFSM5IT6CSDQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

peastman · 2022-02-22T17:35:59Z

It's possible. Can you open an issue on the NNPOps repository describing exactly how you would want it to work?

giadefa · 2022-02-23T08:31:57Z

@raimis can yuo make an issue there as you probbably know the details of what you need in NNPOps.

raimis · 2022-02-23T15:24:38Z

Just before going into NNPOps, I checked how much CUDA Graphs (https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) can help.

CUDA Graphs don't work with TorchMD_GN due to rusty1s/pytorch_cluster#123. To circumvent this, I have implemented a fake neighbor search (#60), which assumes that all the atoms are neighbors, a.k.a. brute force.

The results (https://github.com/raimis/torchmd-net/blob/poc_cuda_graph/benchmarks/graph_network.ipynb) are promising:

For alanine dipeptide (ALA2, 22 atoms) and testosterone (TST, 49 atoms), the brute force approach with CUDA Graphs beat everything else.
For chignolin (CLN, 166 atoms), the brute force is not longer the best and, for larger systems, it runs out of memory.

Ping: @giadefa @peastman @claudi

giadefa · 2022-02-23T15:30:17Z

nice and could we batch that?

…

On Wed, Feb 23, 2022 at 4:24 PM Raimondas Galvelis ***@***.***> wrote: Just before going into NNPOps, I checked how much CUDA Graphs ( https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) can help. CUDA Graphs don't work with TorchMD_GN due to rusty1s/pytorch_cluster#123 <rusty1s/pytorch_cluster#123>. To circumvent this, I have implemented a fake neighbor search, which assume that all the atoms are neighbors, a.k.a. brute force. The results ( https://github.com/raimis/torchmd-net/blob/poc_cuda_graph/benchmarks/graph_network.ipynb) are promising: [image: image] <https://user-images.githubusercontent.com/2469715/155347806-22cc0fb6-29eb-4cea-b504-b3c75e9f91ef.png> - For alanine dipeptide (ALA2, 22 atoms) and testosterone (TST, 49 atoms), the brute force approach with CUDA Graphs beat everything else. - For chignolin (CLN, 166 atoms), the brute force is not longer the best and for larger systems it runs out of memory. Ping: @giadefa <https://github.com/giadefa> @peastman <https://github.com/peastman> @claudi <https://github.com/claudi> — Reply to this email directly, view it on GitHub <#48 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOXNXMXRGQU6AYQQ43LU4T34FANCNFSM5IT6CSDQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

raimis · 2022-02-23T15:43:50Z

The current implementation doesn't support batching, but it could be implemented.

peastman · 2022-02-23T17:30:22Z

That's interesting. It tells us that for the smaller molecules, the computation time is just dominated by kernel launch overhead.

raimis · 2022-04-25T14:36:06Z

Optimization: round 2

I have wrote optimized kernels for the neighbor search (#61) and message passing (#69). The kernels are drop-in replacement for the generic kernels from PyTorch Geometric and have such optimizations:

Take into account the symmetry (i.e. if A is bonded to B, then B is bonded to A), assume 3D space, etc. (borrowed ideas from @peastman code https://github.com/openmm/NNPOps/tree/master/src/schnet)
Compatible with CUDA Graphs

Speed:

kernels use just the new kernels
kernels+graphs use the new kernels and CUDA Graphs
Other benchmarks as in the previous plot (Optimization of the graph network #48 (comment))
There is a significant speed up for the small molecules, as it even more removes overhead.
For large molecule, the speed is comparable to the @peastman kernels, as time is dominated by computation by itself.
The new kernels fail with STMV, but not due to the lack of memory. Still I need to debug the cause.

Full details in the notebook: https://github.com/raimis/torchmd-net/blob/poc_cuda_graph_2/benchmarks/graph_network.ipynb

raimis self-assigned this Nov 23, 2021

raimis mentioned this issue Feb 24, 2022

Support CUDA Graphs openmm/openmm-torch#68

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of the graph network #48

Optimization of the graph network #48

raimis commented Nov 23, 2021 •

edited

Loading

raimis commented Nov 24, 2021

raimis commented Dec 9, 2021

raimis commented Feb 22, 2022

peastman commented Feb 22, 2022

giadefa commented Feb 22, 2022 via email

peastman commented Feb 22, 2022

giadefa commented Feb 22, 2022 via email

peastman commented Feb 22, 2022

giadefa commented Feb 23, 2022

raimis commented Feb 23, 2022 •

edited

Loading

giadefa commented Feb 23, 2022 via email

raimis commented Feb 23, 2022 •

edited

Loading

peastman commented Feb 23, 2022

raimis commented Apr 25, 2022 •

edited

Loading

Optimization of the graph network #48

Optimization of the graph network #48

Comments

raimis commented Nov 23, 2021 • edited Loading

raimis commented Nov 24, 2021

raimis commented Dec 9, 2021

raimis commented Feb 22, 2022

peastman commented Feb 22, 2022

giadefa commented Feb 22, 2022 via email

peastman commented Feb 22, 2022

giadefa commented Feb 22, 2022 via email

peastman commented Feb 22, 2022

giadefa commented Feb 23, 2022

raimis commented Feb 23, 2022 • edited Loading

giadefa commented Feb 23, 2022 via email

raimis commented Feb 23, 2022 • edited Loading

peastman commented Feb 23, 2022

raimis commented Apr 25, 2022 • edited Loading

raimis commented Nov 23, 2021 •

edited

Loading

raimis commented Feb 23, 2022 •

edited

Loading

raimis commented Feb 23, 2022 •

edited

Loading

raimis commented Apr 25, 2022 •

edited

Loading