Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA optimized ASSET #351

Merged
merged 21 commits into from
Jan 29, 2021
Merged

CUDA optimized ASSET #351

merged 21 commits into from
Jan 29, 2021

Conversation

dizcza
Copy link
Member

@dizcza dizcza commented Sep 10, 2020

Requirements

  • Exactly 1 Nvidia GPU card. Multi-GPUs are currently not supported: if you have multiple GPUs, set the environment flag CUDA_VISIBLE_DEVICES=0.
  • CUDA Toolkit installed (nvcc compiler command is used)

The intercommunication CPU <-> CUDA is currently naive: store log_du in a temp file, compile and run CUDA, and read the resulting file in Python. We can think of a more pythonic approach in the future.

Benchmarking

You can run this code in Google colabs (make sure to select GPU backend: Runtime -> Change runtime type).

import numpy as np
from elephant.asset.asset import _JSFUniformOrderStat3D
import os

L = 100
N = 50
D = 7

u = np.arange(L * D, dtype=np.float32).reshape((-1, D))
u /= np.max(u)

jsf = _JSFUniformOrderStat3D(n=N, d=D, verbose=True)

print(jsf.num_iterations)  # 2.2e8

os.environ['ELEPHANT_USE_CUDA'] = '1'
%timeit -r1 -n1 jsf.compute(u)   # 10 sec
os.environ['ELEPHANT_USE_CUDA'] = '0'
%timeit -r1 -n1 jsf.compute(u)   # 5 hours

Approx. speedup is x1000.

One can also play with cuda_threads optional argument (64 by default) in ASSET.joint_probability_matrix function.

Floating-point tolerance error

Floating-point arithmetic does not necessarily obey the associative rule:

(a + b) + c == a + (b + c)

In other words, the order of adding the floats impacts the result. MPI is not an exception (see the issue that refers to the case L=2, N=14, D=13; float). Therefore, to be safe, double precision should be always used instead of previously used float (although even double deviates but the impact is less severe). Below is a comparison of different backends using the benchmark code above.

single process MPI CUDA
L=2, N=14, D=13; float 0.945907 0.947636 0.947983
L=2, N=14, D=13; double 0.947984 0.947984 0.947983
L=2, N=250, D=3; float 0.99999810 0.99999934 0.999834
L=2, N=250, D=3; double 1.0 1.0 1.000007

Change in logic

The logic is not changed, except for the _num_iterations function (the matrix is created in a different way). The tests prove that the behavior has not been changed. The default precision of the jmat matrix is changed from float to double.

Alternatives

An alternative would be to exploit the fact that the computation is performed on the neighbors of a cell of a matrix. Therefore, convolution operations could take place.

Testing and code coverage

Cuda implementation is not tested on travis and we'll have a small dropdown in the test coverage. CircleCI provides GPU capabilities (paid account is required).


Elephant-wise CUDA support

I was thinking about how to accelerate Elephant to utilize GPU for linear algebra (matrix multiplication, addition, multiplication; solvers) seamlessly for the user. I see the future in PyTorch: there is an outgoing PR in PyTorch that makes numpy-equivalent API:

import torch.np as np

This simple line could make a huge speedup for any module in Elephant that uses linear algebra, assuming that we handle numpy -> torch and torch -> numpy data transfers at the beginning and the end of the function body.

@coveralls
Copy link
Collaborator

coveralls commented Sep 10, 2020

Coverage Status

Coverage decreased (-0.7%) to 89.609% when pulling d2ca4b1 on INM-6:cuda/asset into 0be27f9 on NeuralEnsemble:master.

Comment on lines +468 to +481
def _next_sequence_sorted(self, iteration):
# an alternative implementation to naive for-loop iteration when the
# MPI size is large. However, it's not clear under which circumstances,
# if any, there is a benefit. That's why this function is not used.
sequence_sorted = []
element = self.n - 1
for row in range(self.d - 1, -1, -1):
map_row = self.map_iterations[row]
while element > row and iteration < map_row[element]:
element -= 1
iteration -= map_row[element]
sequence_sorted.append(element + 1)
return tuple(sequence_sorted)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you're planning to remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included this function as one-to-one correspondence with https://github.com/INM-6/elephant/blob/cuda/asset/elephant/asset/asset.template.cu#L85 and showed in tests how one can use it. Two reasons to keep it:

  1. It helps to understand the CUDA code.
  2. Heat framework could utilize this method in their heat-accelerated ASSET, if their framework allows users to operate directly with CUDA kernels that require iterations to be independent by nature.
    I've spent two weeks agonizing that ASSET can be rewritten to get rid of its sequential processing, which ruins CUDA acceleration. This function proves the idea is correct.

@dizcza dizcza merged commit 1e5e33c into NeuralEnsemble:master Jan 29, 2021
@dizcza dizcza deleted the cuda/asset branch January 29, 2021 15:41
ackurth pushed a commit to INM-6/elephant that referenced this pull request Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants