-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA optimized ASSET #351
CUDA optimized ASSET #351
Conversation
def _next_sequence_sorted(self, iteration): | ||
# an alternative implementation to naive for-loop iteration when the | ||
# MPI size is large. However, it's not clear under which circumstances, | ||
# if any, there is a benefit. That's why this function is not used. | ||
sequence_sorted = [] | ||
element = self.n - 1 | ||
for row in range(self.d - 1, -1, -1): | ||
map_row = self.map_iterations[row] | ||
while element > row and iteration < map_row[element]: | ||
element -= 1 | ||
iteration -= map_row[element] | ||
sequence_sorted.append(element + 1) | ||
return tuple(sequence_sorted) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you're planning to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included this function as one-to-one correspondence with https://github.com/INM-6/elephant/blob/cuda/asset/elephant/asset/asset.template.cu#L85 and showed in tests how one can use it. Two reasons to keep it:
- It helps to understand the CUDA code.
- Heat framework could utilize this method in their heat-accelerated ASSET, if their framework allows users to operate directly with CUDA kernels that require iterations to be independent by nature.
I've spent two weeks agonizing that ASSET can be rewritten to get rid of its sequential processing, which ruins CUDA acceleration. This function proves the idea is correct.
Requirements
CUDA_VISIBLE_DEVICES=0
.nvcc
compiler command is used)The intercommunication CPU <-> CUDA is currently naive: store
log_du
in a temp file, compile and run CUDA, and read the resulting file in Python. We can think of a more pythonic approach in the future.Benchmarking
You can run this code in Google colabs (make sure to select GPU backend: Runtime -> Change runtime type).
Approx. speedup is x1000.
One can also play with
cuda_threads
optional argument (64 by default) inASSET.joint_probability_matrix
function.Floating-point tolerance error
Floating-point arithmetic does not necessarily obey the associative rule:
In other words, the order of adding the floats impacts the result. MPI is not an exception (see the issue that refers to the case
L=2, N=14, D=13; float
). Therefore, to be safe, double precision should be always used instead of previously used float (although even double deviates but the impact is less severe). Below is a comparison of different backends using the benchmark code above.Change in logic
The logic is not changed, except for the
_num_iterations
function (the matrix is created in a different way). The tests prove that the behavior has not been changed. The default precision of thejmat
matrix is changed from float to double.Alternatives
An alternative would be to exploit the fact that the computation is performed on the neighbors of a cell of a matrix. Therefore, convolution operations could take place.
Testing and code coverage
Cuda implementation is not tested on travis and we'll have a small dropdown in the test coverage. CircleCI provides GPU capabilities (paid account is required).
Elephant-wise CUDA support
I was thinking about how to accelerate Elephant to utilize GPU for linear algebra (matrix multiplication, addition, multiplication; solvers) seamlessly for the user. I see the future in PyTorch: there is an outgoing PR in PyTorch that makes numpy-equivalent API:
This simple line could make a huge speedup for any module in Elephant that uses linear algebra, assuming that we handle numpy -> torch and torch -> numpy data transfers at the beginning and the end of the function body.