Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: convolve with distributed kernel on multiple GPUs #1085

Closed
mtar opened this issue Feb 2, 2023 · 3 comments · Fixed by #1095
Closed

[Bug]: convolve with distributed kernel on multiple GPUs #1085

mtar opened this issue Feb 2, 2023 · 3 comments · Fixed by #1095

Comments

@mtar
Copy link
Collaborator

mtar commented Feb 2, 2023

What happened?

convolve does not work if the kernel is distributed when more than one GPU is available.

Code snippet triggering the error

import heat as ht

dis_signal = ht.arange(0, 16, split=0, device='gpu', dtype=ht.int)
dis_kernel_odd = ht.ones(3, split=0, dtype=ht.int, device='gpu')
conv = ht.convolve(dis_signal, dis_kernel_odd, mode='full')

Error message or erroneous outcome

$ CUDA_VISIBLE_DEVICES=0,1,2,3 srun --ntasks=2 -l python test.py 
1:Traceback (most recent call last):
1:   File ".../test.py", line 7, in <module>
1:     conv = ht.convolve(dis_signal, dis_kernel_odd, mode='full')
1:   File ".../heat-venv_2023/lib/python3.10/site-packages/heat/core/signal.py", line 161, in convolve
1:     local_signal_filtered = fc.conv1d(signal, t_v1)
1: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

Version

main (development branch)

Python version

3.10

PyTorch version

1.12

MPI version

OpenMPI 4.1.4
@mtar mtar added bug Something isn't working devices signal processing signal redistribution Related to distributed tensors communication and removed devices redistribution Related to distributed tensors labels Feb 2, 2023
@mtar
Copy link
Collaborator Author

mtar commented Feb 3, 2023

rec_v = v.comm.bcast(t_v, root=r)

This is causing the issue. Unlike our own implementation, it keeps the device between processes.

@shahpratham
Copy link
Collaborator

@mtar should I use 'Bcast' then, as mentioned here in #790?

@mtar
Copy link
Collaborator Author

mtar commented Feb 6, 2023

@mtar should I use 'Bcast' then, as mentioned here in #790?

Yes, that should fix it.

mtar added a commit that referenced this issue Feb 27, 2023
* changed to Bcast

* clone tensor before Bcast

---------

Co-authored-by: Michael Tarnawa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants