-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large data counts support for MPI Communication #1765
base: main
Are you sure you want to change the base?
Conversation
Thank you for the PR! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1765 +/- ##
==========================================
- Coverage 92.26% 92.25% -0.01%
==========================================
Files 84 84
Lines 12447 12463 +16
==========================================
+ Hits 11484 11498 +14
- Misses 963 965 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Thank you for the PR! |
1 similar comment
Thank you for the PR! |
I have encountered the following problem:
results in the following error:
With 2 ** 10 in the last entry of shape, there is not problem, so it seems to be related to large counts. |
Benchmarks results - Sponsored by perun
Grafana Dashboard |
Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers? |
Thank you for the PR! |
Have you found some bug? I don't think it should be an issue, as the vector datatype is just pointing to where the data is, where it needs to go, and it in what order. As long as both send and recv buffers are well-defined by the datatype, there should not be an issue with MPI operations. |
The example with Allreduce I posted above caused an error for me. |
Thank you for the PR! |
Thank you for the PR! |
Had a look, and found out that reduction operations with derived datatypes require custom made reduction functions to work. Latest commits handle that. It came with a slightly bigger than expected refactoring of After some early and none conclusive tests, it might have a both a runtime and memory usage improvement. import heat as ht
import torch
import perun
from heat.core.communication import CUDA_AWARE_MPI
print(f"CUDA_AWARE_MPI: {CUDA_AWARE_MPI}")
world_size = ht.MPI_WORLD.size
rank = ht.MPI_WORLD.rank
print(f"{rank}/{world_size} reporting for duty!")
dtype = torch.int64
@perun.monitor()
def cpu_inplace_contiguous(n_elements, shape):
data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
del data
@perun.monitor()
def cpu_inplace_non_contiguous(n_elements, shape):
data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
data = data.permute(2,1,0)
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
del data
@perun.monitor()
def cpu_contiguous(n_elements, shape):
data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
zeros = torch.zeros(shape, dtype=dtype)
ht.MPI_WORLD.Allreduce(zeros, data, ht.MPI.SUM)
del data
del zeros
@perun.monitor()
def cpu_non_contiguous(n_elements, shape):
data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
data = data.permute(2,1,0)
zeros = torch.zeros(shape, dtype=dtype)
zeros = zeros.permute(2,1,0)
ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
del data
del zeros
@perun.monitor()
def gpu_inplace_contiguous(n_elements, shape):
data: torch.Tensor = torch.arange(n_elements, dtype=dtype, device=f"cuda:{rank}").reshape(shape)
print(f"Rank {rank} working on device {data.device}")
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
del data
@perun.monitor()
def gpu_inplace_non_contiguous(n_elements, shape):
data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
data = data.permute(2,1,0)
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
del data
@perun.monitor()
def gpu_contiguous(n_elements, shape):
data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
zeros = torch.zeros(shape, dtype=dtype,device="cuda")
ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
del data
del zeros
@perun.monitor()
def gpu_non_contiguous(n_elements, shape):
data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
data = data.permute(2,1,0)
zeros = torch.zeros(shape, dtype=dtype,device="cuda")
zeros = zeros.permute(2,1,0)
ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
del data
del zeros
if __name__ == "__main__":
base_shape = (500, 500, 500 * world_size)
shape = base_shape
n_elements = torch.prod(torch.tensor(shape))
print(f"N Elements: {n_elements}, Memory: {n_elements * dtype.itemsize / (1000**3)}Gb")
print("cpu_contigous")
cpu_contiguous(n_elements, shape)
print("cpu_non_contigous")
cpu_non_contiguous(n_elements,shape)
print("cpu_inplace_contiguous")
cpu_inplace_contiguous(n_elements, shape)
print("cpu_inplace_non_contiguous")
cpu_inplace_non_contiguous(n_elements, shape)
print("gpu_inplace_contiguous")
gpu_inplace_contiguous(n_elements, shape)
print("gpu_inplace_non_contiguous")
gpu_inplace_non_contiguous(n_elements, shape)
print("gpu_contigous")
gpu_contiguous(n_elements, shape)
print("gpu_non_contigous")
gpu_non_contiguous(n_elements,shape) Here is the script that I was using to test this. I'll try to get more conclusive results during the week. (And to fix the benchmarks, this is breaking them for some reason) |
Thank you for the PR! |
Due Diligence
Description
Some MPI implementation are limited to sending only 2^31-1 elements at once. As far as I have tested, this also applies for OpenMPI 4.1 and 5.0, because support has not been added to mpi4py. (At least in my tests it failed).
This small changes uses the trick described here, to pack contiguous data into an MPI Vector, extending the limit of elements being sent.
This is for contiguous data, as non-contiguous data is already packed in recursive vector data types, reducing the need to apply this trick.
Issue/s resolved: #
Changes proposed:
Type of change
Does this change modify the behaviour of other functions? If so, which?
yes, probably a lot of them.