CUDA optimized ASSET #351

dizcza · 2020-09-10T10:02:48Z

Requirements

Exactly 1 Nvidia GPU card. Multi-GPUs are currently not supported: if you have multiple GPUs, set the environment flag CUDA_VISIBLE_DEVICES=0.
CUDA Toolkit installed (nvcc compiler command is used)

The intercommunication CPU <-> CUDA is currently naive: store log_du in a temp file, compile and run CUDA, and read the resulting file in Python. We can think of a more pythonic approach in the future.

Benchmarking

You can run this code in Google colabs (make sure to select GPU backend: Runtime -> Change runtime type).

import numpy as np
from elephant.asset.asset import _JSFUniformOrderStat3D
import os

L = 100
N = 50
D = 7

u = np.arange(L * D, dtype=np.float32).reshape((-1, D))
u /= np.max(u)

jsf = _JSFUniformOrderStat3D(n=N, d=D, verbose=True)

print(jsf.num_iterations)  # 2.2e8

os.environ['ELEPHANT_USE_CUDA'] = '1'
%timeit -r1 -n1 jsf.compute(u)   # 10 sec
os.environ['ELEPHANT_USE_CUDA'] = '0'
%timeit -r1 -n1 jsf.compute(u)   # 5 hours

Approx. speedup is x1000.

One can also play with cuda_threads optional argument (64 by default) in ASSET.joint_probability_matrix function.

Floating-point tolerance error

Floating-point arithmetic does not necessarily obey the associative rule:

(a + b) + c == a + (b + c)

In other words, the order of adding the floats impacts the result. MPI is not an exception (see the issue that refers to the case L=2, N=14, D=13; float). Therefore, to be safe, double precision should be always used instead of previously used float (although even double deviates but the impact is less severe). Below is a comparison of different backends using the benchmark code above.

	single process	MPI	CUDA
L=2, N=14, D=13; float	0.945907	0.947636	0.947983
L=2, N=14, D=13; double	0.947984	0.947984	0.947983
L=2, N=250, D=3; float	0.99999810	0.99999934	0.999834
L=2, N=250, D=3; double	1.0	1.0	1.000007

Change in logic

The logic is not changed, except for the _num_iterations function (the matrix is created in a different way). The tests prove that the behavior has not been changed. The default precision of the jmat matrix is changed from float to double.

Alternatives

An alternative would be to exploit the fact that the computation is performed on the neighbors of a cell of a matrix. Therefore, convolution operations could take place.

Testing and code coverage

Cuda implementation is not tested on travis and we'll have a small dropdown in the test coverage. CircleCI provides GPU capabilities (paid account is required).

Elephant-wise CUDA support

I was thinking about how to accelerate Elephant to utilize GPU for linear algebra (matrix multiplication, addition, multiplication; solvers) seamlessly for the user. I see the future in PyTorch: there is an outgoing PR in PyTorch that makes numpy-equivalent API:

import torch.np as np

This simple line could make a huge speedup for any module in Elephant that uses linear algebra, assuming that we handle numpy -> torch and torch -> numpy data transfers at the beginning and the end of the function body.

coveralls · 2020-09-10T10:17:50Z

Coverage decreased (-0.7%) to 89.609% when pulling d2ca4b1 on INM-6:cuda/asset into 0be27f9 on NeuralEnsemble:master.

Kleinjohann · 2021-01-29T15:20:29Z

elephant/asset/asset.py

+    def _next_sequence_sorted(self, iteration):
+        # an alternative implementation to naive for-loop iteration when the
+        # MPI size is large. However, it's not clear under which circumstances,
+        # if any, there is a benefit. That's why this function is not used.
+        sequence_sorted = []
+        element = self.n - 1
+        for row in range(self.d - 1, -1, -1):
+            map_row = self.map_iterations[row]
+            while element > row and iteration < map_row[element]:
+                element -= 1
+            iteration -= map_row[element]
+            sequence_sorted.append(element + 1)
+        return tuple(sequence_sorted)
+


I guess you're planning to remove this?

I've included this function as one-to-one correspondence with https://github.com/INM-6/elephant/blob/cuda/asset/elephant/asset/asset.template.cu#L85 and showed in tests how one can use it. Two reasons to keep it:

It helps to understand the CUDA code.

Heat framework could utilize this method in their heat-accelerated ASSET, if their framework allows users to operate directly with CUDA kernels that require iterations to be independent by nature.
I've spent two weeks agonizing that ASSET can be rewritten to get rid of its sequential processing, which ruins CUDA acceleration. This function proves the idea is correct.

CUDA optimized ASSET

617b9b8

dizcza added the optimization label Sep 10, 2020

wrap asset import

a009c92

dizcza added 7 commits September 10, 2020 14:58

check for is_cuda_available

393eebc

python2

248b74b

fixed _is_cuda_available

ef4f07e

double float everywhere

b49a89c

Merge branch 'master' into cuda/asset

64d7c20

precision_printf

933ef38

log_du always float

ca6f99e

dizcza force-pushed the cuda/asset branch from b665d5f to ca6f99e Compare October 9, 2020 17:04

dizcza added 12 commits October 9, 2020 20:05

upd doc

9549983

reduced branch divergence, atomicAdd misses

539ae70

manually set CWR_LOOPS

4fa486c

fixed codefactor issues

ddf39a5

cuda header doc

a27be5d

software implementation of atomicAdd for doubles; doubles are by default

1a9fdbd

get_cuda_capability_major

378291f

added test_precision

148b004

Merge branch 'master' into cuda/asset

f5e0165

clip cuda P_total array

7c15545

Merge branch 'master' into cuda/asset

503e7ec

reverted python2 compatibility

d2ca4b1

Kleinjohann reviewed Jan 29, 2021

View reviewed changes

dizcza merged commit 1e5e33c into NeuralEnsemble:master Jan 29, 2021

dizcza deleted the cuda/asset branch January 29, 2021 15:41

ackurth pushed a commit to INM-6/elephant that referenced this pull request Oct 1, 2021

CUDA accelerated ASSET (NeuralEnsemble#351)

305fc52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA optimized ASSET #351

CUDA optimized ASSET #351

dizcza commented Sep 10, 2020 •

edited

Loading

coveralls commented Sep 10, 2020 •

edited

Loading

Kleinjohann Jan 29, 2021

dizcza Jan 29, 2021

CUDA optimized ASSET #351

CUDA optimized ASSET #351

Conversation

dizcza commented Sep 10, 2020 • edited Loading

Requirements

Benchmarking

Floating-point tolerance error

Change in logic

Alternatives

Testing and code coverage

Elephant-wise CUDA support

coveralls commented Sep 10, 2020 • edited Loading

Kleinjohann Jan 29, 2021

Choose a reason for hiding this comment

dizcza Jan 29, 2021

Choose a reason for hiding this comment

dizcza commented Sep 10, 2020 •

edited

Loading

coveralls commented Sep 10, 2020 •

edited

Loading