Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large data counts support for MPI Communication #1765

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

JuanPedroGHM
Copy link
Member

@JuanPedroGHM JuanPedroGHM commented Jan 22, 2025

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • benchmarks: created for new functionality
    • benchmarks: performance improved or maintained
    • documentation updated where needed

Description

Some MPI implementation are limited to sending only 2^31-1 elements at once. As far as I have tested, this also applies for OpenMPI 4.1 and 5.0, because support has not been added to mpi4py. (At least in my tests it failed).

This small changes uses the trick described here, to pack contiguous data into an MPI Vector, extending the limit of elements being sent.

This is for contiguous data, as non-contiguous data is already packed in recursive vector data types, reducing the need to apply this trick.

Issue/s resolved: #

Changes proposed:

  • MPI Vector to send more than 2^31-1 elements at once.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Does this change modify the behaviour of other functions? If so, which?

yes, probably a lot of them.

Copy link
Contributor

Thank you for the PR!

Copy link

codecov bot commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.25%. Comparing base (d66e404) to head (70f6432).

Files with missing lines Patch % Lines
heat/core/communication.py 88.23% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1765      +/-   ##
==========================================
- Coverage   92.26%   92.25%   -0.01%     
==========================================
  Files          84       84              
  Lines       12447    12463      +16     
==========================================
+ Hits        11484    11498      +14     
- Misses        963      965       +2     
Flag Coverage Δ
unit 92.25% <88.23%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the testing Implementation of tests, or test-related issues label Jan 27, 2025
Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 27, 2025

I have encountered the following problem:

import heat as ht 
import torch

shape = (2 ** 10, 2 ** 10, 2 ** 11)

data = torch.ones(shape, dtype=torch.float32) * ht.MPI_WORLD.rank
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)

results in the following error:

  File /heat/heat/core/communication.py", line 915, in Allreduce
    ret, sbuf, rbuf, buf = self.__reduce_like(self.handle.Allreduce, sendbuf, recvbuf, op)
  File "/heat/heat/core/communication.py", line 895, in __reduce_like
    return func(sendbuf, recvbuf, *args, **kwargs), sbuf, rbuf, buf
  File "src/mpi4py/MPI.src/Comm.pyx", line 1115, in mpi4py.MPI.Comm.Allreduce
mpi4py.MPI.Exception: MPI_ERR_OP: invalid reduce operation

With 2 ** 10 in the last entry of shape, there is not problem, so it seems to be related to large counts.

@JuanPedroGHM
Copy link
Member Author

JuanPedroGHM commented Jan 27, 2025

Benchmarks results - Sponsored by perun

function mpi_ranks device metric value ref_value std % change type alert lower_quantile upper_quantile
matmul_split_0 4 CPU RUNTIME 0.149021 0.133988 0.0111788 11.2196 jump-detection True nan nan
reshape 4 CPU RUNTIME 0.150924 0.205227 0.0297334 -26.4599 jump-detection True nan nan
apply_inplace_max_abs_scaler_and_inverse 4 CPU RUNTIME 0.000655345 0.000508517 8.00391e-05 28.8737 jump-detection True nan nan
apply_inplace_normalizer 4 CPU RUNTIME 0.00407308 0.0010457 0.00748763 289.506 jump-detection True nan nan
qr_split_0 4 CPU RUNTIME 0.244178 0.23604 0.00674248 3.44758 trend-deviation True 0.231439 0.240179
qr_split_1 4 CPU RUNTIME 0.154358 0.174964 0.00222697 -11.7775 trend-deviation True 0.169368 0.187895
hierachical_svd_rank 4 CPU RUNTIME 0.0493446 0.0477514 0.00171717 3.3365 trend-deviation True 0.0466213 0.0486949
hierachical_svd_tol 4 CPU RUNTIME 0.0534304 0.0521002 0.0014912 2.55317 trend-deviation True 0.0515334 0.0530422
kmeans 4 CPU RUNTIME 0.322664 0.311884 0.0164465 3.45656 trend-deviation True 0.306719 0.318458
kmedians 4 CPU RUNTIME 0.441626 0.425078 0.0263389 3.89277 trend-deviation True 0.416978 0.435575
reshape 4 CPU RUNTIME 0.150924 0.200174 0.0297334 -24.6034 trend-deviation True 0.160781 0.213606
incremental_pca_split0 4 CPU RUNTIME 34.4686 37.3459 0.0511448 -7.70459 trend-deviation True 37.1031 37.6187

Grafana Dashboard
Last updated: 2025-02-05T11:19:48Z

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 27, 2025

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM
Copy link
Member Author

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

Have you found some bug? I don't think it should be an issue, as the vector datatype is just pointing to where the data is, where it needs to go, and it in what order. As long as both send and recv buffers are well-defined by the datatype, there should not be an issue with MPI operations.

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 28, 2025

The example with Allreduce I posted above caused an error for me.

Copy link
Contributor

github-actions bot commented Feb 4, 2025

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Feb 4, 2025

Thank you for the PR!

@JuanPedroGHM
Copy link
Member Author

Had a look, and found out that reduction operations with derived datatypes require custom made reduction functions to work. Latest commits handle that. It came with a slightly bigger than expected refactoring of __reduce_like, but it might have a lot of benefits.

After some early and none conclusive tests, it might have a both a runtime and memory usage improvement.

import heat as ht 
import torch
import perun

from heat.core.communication import CUDA_AWARE_MPI
print(f"CUDA_AWARE_MPI: {CUDA_AWARE_MPI}")

world_size = ht.MPI_WORLD.size
rank = ht.MPI_WORLD.rank
print(f"{rank}/{world_size} reporting for duty!")
dtype = torch.int64

@perun.monitor()
def cpu_inplace_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
    ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
    del data

@perun.monitor()
def cpu_inplace_non_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
    data = data.permute(2,1,0)
    ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
    del data

@perun.monitor()
def cpu_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
    zeros = torch.zeros(shape, dtype=dtype)
    ht.MPI_WORLD.Allreduce(zeros, data, ht.MPI.SUM)
    del data
    del zeros

@perun.monitor()
def cpu_non_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
    data = data.permute(2,1,0)
    zeros = torch.zeros(shape, dtype=dtype)
    zeros = zeros.permute(2,1,0)
    ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
    del data
    del zeros

@perun.monitor()
def gpu_inplace_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements, dtype=dtype, device=f"cuda:{rank}").reshape(shape)
    print(f"Rank {rank} working on device {data.device}")
    ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
    del data

@perun.monitor()
def gpu_inplace_non_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
    data = data.permute(2,1,0)
    ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
    del data

@perun.monitor()
def gpu_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
    zeros = torch.zeros(shape, dtype=dtype,device="cuda")
    ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
    del data
    del zeros

@perun.monitor()
def gpu_non_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
    data = data.permute(2,1,0)
    zeros = torch.zeros(shape, dtype=dtype,device="cuda")
    zeros = zeros.permute(2,1,0)
    ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
    del data
    del zeros

    

if __name__ == "__main__":
    base_shape = (500, 500, 500 * world_size)
    shape = base_shape
    n_elements = torch.prod(torch.tensor(shape))
    print(f"N Elements: {n_elements}, Memory: {n_elements * dtype.itemsize / (1000**3)}Gb")

    print("cpu_contigous")
    cpu_contiguous(n_elements, shape)
    print("cpu_non_contigous")
    cpu_non_contiguous(n_elements,shape)
    print("cpu_inplace_contiguous")
    cpu_inplace_contiguous(n_elements, shape)
    print("cpu_inplace_non_contiguous")
    cpu_inplace_non_contiguous(n_elements, shape)

    print("gpu_inplace_contiguous")
    gpu_inplace_contiguous(n_elements, shape)
    print("gpu_inplace_non_contiguous")
    gpu_inplace_non_contiguous(n_elements, shape)
    print("gpu_contigous")
    gpu_contiguous(n_elements, shape)
    print("gpu_non_contigous")
    gpu_non_contiguous(n_elements,shape)

Here is the script that I was using to test this. I'll try to get more conclusive results during the week. (And to fix the benchmarks, this is breaking them for some reason)

@JuanPedroGHM JuanPedroGHM requested a review from mrfh92 February 4, 2025 16:43
Copy link
Contributor

github-actions bot commented Feb 5, 2025

Thank you for the PR!

@JuanPedroGHM
Copy link
Member Author

image
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release backport stable benchmark PR benchmarking bug Something isn't working core PR talk testing Implementation of tests, or test-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants