PyCUDA and PyOpenCL backends for ASSET joint prob. matrix calculation #404

dizcza · 2021-02-16T11:00:09Z

Benchmarks (compared to the original Python implementation):

PyCUDA: x1000 and more
PyOpenCL: x100 and more

Changes to _JSFUniformOrderStat3D class:

Changed the default precision from double to float - I find floats perform reasonably well as doubles with my built-in Intel graphics card and yet floats are x4 faster than doubles for both backends. In either case, the users can easily change the precision manually at any time.
Added pycuda() backend that allows copying arrays from RAM (virtual Python memory) directly to CUDA global memory.
Added pyopencl() backend - suited for all laptops with built-in Intel GPU card. As in the PyCUDA backend, it allows copying directly to (Intel) GPU global memory.
Renamed cuda() backend to _cuda(). This backend was the breakthrough in accelerating ASSET with CUDA, but it suffers from disk I/O operations. The _cuda() backend and joint_pmat_old.cu file should be removed once the PyCUDA backend becomes no longer an experimental feature.
Added a watchdog that barks when the computed values of a joint prob. matrix are outside the valid [0, 1] interval. For this reason, the tolerance parameter is added in the joint_probability_matrix function.
Added description of how to install CUDA and OpenCL support in Elephant installation documentation. In particular, if PyOpenCL backend is used, the users need to disable GPU Hangcheck (described in install.rst) to prevent kernel resets when computation takes a long time to finish. Doing so, of course, makes the system unresponsive until the compute program terminates.

Changes to _pmat_neighbors function:

Accelerated with CUDA and OpenCL. Rewrote to _PMatNeighbors class to facilitate different backends. It became thousands of times faster, the exact speedup doesn't matter - it runs in the blink of an eye.

Changes to other Python functions:

Optimized du = np.diff(u) in both memory and speed (see the compute function).
x5 less memory footprint of the cluster_matrix_entries function with chunking. The chunk size is controlled by the working_memory parameter. With working_memory set to 100, the peak allocation memory (of cluster_matrix_entries and, therefore, ASSET itself since it was the most memory consuming part of ASSET after the bug in pmat is fixed in Memory efficient and faster implementation of ASSET pmat analytical #399) is reduced to, compared to the master branch:

mmat.shape	No chunking, Mb	Chunked, Mb
(150, 150)	4300	815
(170, 170)	7600	1440
(200, 200)	MemErr	3000
(250, 250)	MemErr	8000
(270, 270)	MemErr	11200

It's also possible to install pyopencl with pip if you've installed Intel GPU driver manually. You need to make sure that pyopencl sees your system-wide libOpenCL.so*. First, find the location of libOpenCL.so and then provide its directory path as LD_LIBRARY_PATH environment flag. Here are the commands I used:

$ ldconfig -p | grep OpenCL
        libOpenCL.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
$ LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu pip install pyopencl

I didn't put it into the install.rst because it's advanced stuff and conda-forge works reliably.

CUDA part is not tested because Travis does not provide GPUs. Here are the steps to test it manually on Google Colab:

Runtime -> Change runtime type -> GPU
pip install -e git+https://github.com/INM-6/elephant#egg=elephant[extras,cuda]
Restart the kernel.
python /content/src/elephant/elephant/test/test_asset.py

…ary format

coveralls · 2021-02-16T18:58:57Z

Coverage decreased (-0.9%) to 88.749% when pulling abe146a on INM-6:cuda/asset into 5e95f77 on NeuralEnsemble:master.

elephant/asset/asset.py

pep8speaks · 2021-02-19T17:35:02Z

Hello @dizcza! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-22 13:48:36 UTC

dizcza added 30 commits February 10, 2021 08:45

asset.cu comments

864b449

wrap cuda commands in gpuErrchk

84a763b

use heap memory instead of stack memory to account for large arrays

5a00f11

unsigned long long L_BLOCK

1d1a790

save log_du in .7f, not .10f

d54f597

void jsf_uniform_orderstat_3d

c5ead76

printf llu

0826b01

PyCUDA

6147c2f

async copy

f7e10fc

cuda extra

cccd1ec

removed GpuArray.free

d2dc719

PyOpenCL

311190a

constant -> shared mem

f4f50d1

opencl uint64 is unsigned long

b141f85

constant memory back

cbbc40e

opencl atomic add double

e417e8d

uncommented output array clipping

a609d6b

choose pyopencl backend

d3fd521

revised cuda detected

a06a7e0

pyopencl iteration_table added LU qualifier

48bf975

pyopencl double unstable

a1f8f48

opencl not interactive

2037641

A tolerance watchdog

576421d

max_l_block true division

6e6abf7

test watchdog

e4bbc74

cuda and opencl installation instructions

27229ef

_check_input() utility function; _cuda() read and write arrays in bin…

782bcce

…ary format

memory efficient numpy diff; explained PyOpenCL stability behavior

4a0a6e8

PyOpenCL is stable

7fe2b91

added jinja2 in environment.yml

bb9344a

dizcza added 11 commits February 15, 2021 21:12

made float precision default

b07abae

added benchmarks info

eec86ec

PyOpenCL split log_du

d4da066

PyCUDA split log_du

e2aae65

mem efficient np.diff

c312778

fixed bug in asset.template.cu: incompatible pointers

cd271e6

double vs float benchmark

e589474

fixed travis

89a16d5

don't use GPU features in ASSET tutorial

e529e00

test with oclgrind instead of pocl

0581bbb

removed max_mem_alloc_size arg from pycuda backend

51fa54f

dizcza added the optimization label Feb 16, 2021

dizcza added 5 commits February 17, 2021 19:36

mem efficient _stretched_metric_2d

09cd4ef

cluster_matrix_entries chunked

016ee2f

include asset.cu and asset.cl and exclude *.nix from the manifest

63ccc0b

test asset decimal precision

80fdf55

test clang 9.0.1

311040f

Kleinjohann reviewed Feb 19, 2021

View reviewed changes

elephant/asset/asset.py Outdated Show resolved Hide resolved

dizcza added 3 commits February 19, 2021 18:22

pmat_neighbors opencl and cuda

440cad3

don't reshape lmat twice

acc1927

pmat neigh reuse memory

3d6d3c9

fixed cu path

4f9319f

dizcza force-pushed the cuda/asset branch from 1e91354 to cf81ff5 Compare February 22, 2021 08:18

handle large arrays

d6bc349

dizcza force-pushed the cuda/asset branch from cf81ff5 to d6bc349 Compare February 22, 2021 12:43

test joint prob. mat. chunked

abe146a

dizcza merged commit e56b1ac into NeuralEnsemble:master Feb 25, 2021

dizcza deleted the cuda/asset branch February 25, 2021 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyCUDA and PyOpenCL backends for ASSET joint prob. matrix calculation #404

PyCUDA and PyOpenCL backends for ASSET joint prob. matrix calculation #404

dizcza commented Feb 16, 2021 •

edited

Loading

coveralls commented Feb 16, 2021 •

edited

Loading

pep8speaks commented Feb 19, 2021 •

edited

Loading

PyCUDA and PyOpenCL backends for ASSET joint prob. matrix calculation #404

PyCUDA and PyOpenCL backends for ASSET joint prob. matrix calculation #404

Conversation

dizcza commented Feb 16, 2021 • edited Loading

coveralls commented Feb 16, 2021 • edited Loading

pep8speaks commented Feb 19, 2021 • edited Loading

Comment last updated at 2021-02-22 13:48:36 UTC

dizcza commented Feb 16, 2021 •

edited

Loading

coveralls commented Feb 16, 2021 •

edited

Loading

pep8speaks commented Feb 19, 2021 •

edited

Loading