Large data counts support for MPI Communication #1765

JuanPedroGHM · 2025-01-22T16:24:19Z

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- benchmarks: created for new functionality
- benchmarks: performance improved or maintained
- documentation updated where needed

Description

Some MPI implementation are limited to sending only 2^31-1 elements at once. As far as I have tested, this also applies for OpenMPI 4.1 and 5.0, because support has not been added to mpi4py. (At least in my tests it failed).

This small changes uses the trick described here, to pack contiguous data into an MPI Vector, extending the limit of elements being sent.

This is for contiguous data, as non-contiguous data is already packed in recursive vector data types, reducing the need to apply this trick.

Issue/s resolved: #

Changes proposed:

MPI Vector to send more than 2^31-1 elements at once.

Type of change

Bug fix (non-breaking change which fixes an issue)

Does this change modify the behaviour of other functions? If so, which?

yes, probably a lot of them.

github-actions · 2025-01-22T16:31:28Z

Thank you for the PR!

codecov · 2025-01-22T17:06:49Z

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.25%. Comparing base (d66e404) to head (70f6432).

Files with missing lines	Patch %	Lines
heat/core/communication.py	88.23%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1765      +/-   ##
==========================================
- Coverage   92.26%   92.25%   -0.01%     
==========================================
  Files          84       84              
  Lines       12447    12463      +16     
==========================================
+ Hits        11484    11498      +14     
- Misses        963      965       +2

Flag	Coverage Δ
unit	`92.25% <88.23%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2025-01-27T10:52:24Z

Thank you for the PR!

github-actions · 2025-01-27T10:52:30Z

Thank you for the PR!

mrfh92 · 2025-01-27T11:00:39Z

I have encountered the following problem:

import heat as ht 
import torch

shape = (2 ** 10, 2 ** 10, 2 ** 11)

data = torch.ones(shape, dtype=torch.float32) * ht.MPI_WORLD.rank
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)

results in the following error:

  File /heat/heat/core/communication.py", line 915, in Allreduce
    ret, sbuf, rbuf, buf = self.__reduce_like(self.handle.Allreduce, sendbuf, recvbuf, op)
  File "/heat/heat/core/communication.py", line 895, in __reduce_like
    return func(sendbuf, recvbuf, *args, **kwargs), sbuf, rbuf, buf
  File "src/mpi4py/MPI.src/Comm.pyx", line 1115, in mpi4py.MPI.Comm.Allreduce
mpi4py.MPI.Exception: MPI_ERR_OP: invalid reduce operation

With 2 ** 10 in the last entry of shape, there is not problem, so it seems to be related to large counts.

JuanPedroGHM · 2025-01-27T11:03:46Z

Benchmarks results - Sponsored by perun

function	mpi_ranks	device	metric	value	ref_value	std	% change	type	alert	lower_quantile	upper_quantile
matmul_split_0	4	CPU	RUNTIME	0.149021	0.133988	0.0111788	11.2196	jump-detection	True	nan	nan
reshape	4	CPU	RUNTIME	0.150924	0.205227	0.0297334	-26.4599	jump-detection	True	nan	nan
apply_inplace_max_abs_scaler_and_inverse	4	CPU	RUNTIME	0.000655345	0.000508517	8.00391e-05	28.8737	jump-detection	True	nan	nan
apply_inplace_normalizer	4	CPU	RUNTIME	0.00407308	0.0010457	0.00748763	289.506	jump-detection	True	nan	nan
qr_split_0	4	CPU	RUNTIME	0.244178	0.23604	0.00674248	3.44758	trend-deviation	True	0.231439	0.240179
qr_split_1	4	CPU	RUNTIME	0.154358	0.174964	0.00222697	-11.7775	trend-deviation	True	0.169368	0.187895
hierachical_svd_rank	4	CPU	RUNTIME	0.0493446	0.0477514	0.00171717	3.3365	trend-deviation	True	0.0466213	0.0486949
hierachical_svd_tol	4	CPU	RUNTIME	0.0534304	0.0521002	0.0014912	2.55317	trend-deviation	True	0.0515334	0.0530422
kmeans	4	CPU	RUNTIME	0.322664	0.311884	0.0164465	3.45656	trend-deviation	True	0.306719	0.318458
kmedians	4	CPU	RUNTIME	0.441626	0.425078	0.0263389	3.89277	trend-deviation	True	0.416978	0.435575
reshape	4	CPU	RUNTIME	0.150924	0.200174	0.0297334	-24.6034	trend-deviation	True	0.160781	0.213606
incremental_pca_split0	4	CPU	RUNTIME	34.4686	37.3459	0.0511448	-7.70459	trend-deviation	True	37.1031	37.6187

Grafana Dashboard
Last updated: 2025-02-05T11:19:48Z

mrfh92 · 2025-01-27T15:59:32Z

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

github-actions · 2025-01-27T16:11:41Z

Thank you for the PR!

JuanPedroGHM · 2025-01-28T09:23:36Z

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

Have you found some bug? I don't think it should be an issue, as the vector datatype is just pointing to where the data is, where it needs to go, and it in what order. As long as both send and recv buffers are well-defined by the datatype, there should not be an issue with MPI operations.

mrfh92 · 2025-01-28T12:09:37Z

The example with Allreduce I posted above caused an error for me.

…ta types

github-actions · 2025-02-04T15:05:48Z

Thank you for the PR!

github-actions · 2025-02-04T16:21:59Z

Thank you for the PR!

JuanPedroGHM · 2025-02-04T16:43:25Z

Had a look, and found out that reduction operations with derived datatypes require custom made reduction functions to work. Latest commits handle that. It came with a slightly bigger than expected refactoring of __reduce_like, but it might have a lot of benefits.

After some early and none conclusive tests, it might have a both a runtime and memory usage improvement.

import heat as ht 
import torch
import perun

from heat.core.communication import CUDA_AWARE_MPI
print(f"CUDA_AWARE_MPI: {CUDA_AWARE_MPI}")

world_size = ht.MPI_WORLD.size
rank = ht.MPI_WORLD.rank
print(f"{rank}/{world_size} reporting for duty!")
dtype = torch.int64

@perun.monitor()
def cpu_inplace_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
    ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
    del data

@perun.monitor()
def cpu_inplace_non_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
    data = data.permute(2,1,0)
    ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
    del data

@perun.monitor()
def cpu_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
    zeros = torch.zeros(shape, dtype=dtype)
    ht.MPI_WORLD.Allreduce(zeros, data, ht.MPI.SUM)
    del data
    del zeros

@perun.monitor()
def cpu_non_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements,dtype=dtype).reshape(shape)
    data = data.permute(2,1,0)
    zeros = torch.zeros(shape, dtype=dtype)
    zeros = zeros.permute(2,1,0)
    ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
    del data
    del zeros

@perun.monitor()
def gpu_inplace_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements, dtype=dtype, device=f"cuda:{rank}").reshape(shape)
    print(f"Rank {rank} working on device {data.device}")
    ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
    del data

@perun.monitor()
def gpu_inplace_non_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
    data = data.permute(2,1,0)
    ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)
    del data

@perun.monitor()
def gpu_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
    zeros = torch.zeros(shape, dtype=dtype,device="cuda")
    ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
    del data
    del zeros

@perun.monitor()
def gpu_non_contiguous(n_elements, shape):
    data: torch.Tensor = torch.arange(n_elements, dtype=dtype,device=f"cuda:{rank}").reshape(shape)
    data = data.permute(2,1,0)
    zeros = torch.zeros(shape, dtype=dtype,device="cuda")
    zeros = zeros.permute(2,1,0)
    ht.MPI_WORLD.Allreduce(data, zeros, ht.MPI.SUM)
    del data
    del zeros

    

if __name__ == "__main__":
    base_shape = (500, 500, 500 * world_size)
    shape = base_shape
    n_elements = torch.prod(torch.tensor(shape))
    print(f"N Elements: {n_elements}, Memory: {n_elements * dtype.itemsize / (1000**3)}Gb")

    print("cpu_contigous")
    cpu_contiguous(n_elements, shape)
    print("cpu_non_contigous")
    cpu_non_contiguous(n_elements,shape)
    print("cpu_inplace_contiguous")
    cpu_inplace_contiguous(n_elements, shape)
    print("cpu_inplace_non_contiguous")
    cpu_inplace_non_contiguous(n_elements, shape)

    print("gpu_inplace_contiguous")
    gpu_inplace_contiguous(n_elements, shape)
    print("gpu_inplace_non_contiguous")
    gpu_inplace_non_contiguous(n_elements, shape)
    print("gpu_contigous")
    gpu_contiguous(n_elements, shape)
    print("gpu_non_contigous")
    gpu_non_contiguous(n_elements,shape)

Here is the script that I was using to test this. I'll try to get more conclusive results during the week. (And to fix the benchmarks, this is breaking them for some reason)

github-actions · 2025-02-05T11:06:27Z

Thank you for the PR!

JuanPedroGHM · 2025-02-07T14:52:37Z

trick to send large data

ed843a4

github-actions bot added backport stable bug Something isn't working core labels Jan 22, 2025

JuanPedroGHM added benchmark PR PR talk labels Jan 22, 2025

Hoppe and others added 2 commits January 27, 2025 11:46

added tests

738c634

Merge branch 'main' into fix/mpi_int_limit_trick

9234df9

github-actions bot added the testing Implementation of tests, or test-related issues label Jan 27, 2025

Merge branch 'main' into fix/mpi_int_limit_trick

70f6432

Fixes for allreduce

23c5de4

github-actions bot added the backport release label Feb 3, 2025

fixed large counts for allreduce, now trying to fix non-contiguous da…

15eeedb

…ta types

mrfh92 mentioned this pull request Feb 3, 2025

Perform testing with a large-count MPI-implementation #1737

Closed

JuanPedroGHM added 5 commits February 4, 2025 11:44

Custom operations for allreduce

784a850

Merge branch 'main' into fix/mpi_int_limit_trick

c8d1752

bench fixes

e6c519d

perun fix

88755ef

correct inplace contiguous (sorry fabian)

eb8eb2b

remove print statements

e56d08a

JuanPedroGHM requested a review from mrfh92 February 4, 2025 16:43

benchmark fixes and debug output

4276bff

github-actions bot added the benchmarking label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large data counts support for MPI Communication #1765

Large data counts support for MPI Communication #1765

JuanPedroGHM commented Jan 22, 2025 •

edited

Loading

github-actions bot commented Jan 22, 2025

codecov bot commented Jan 22, 2025 •

edited

Loading

github-actions bot commented Jan 27, 2025

github-actions bot commented Jan 27, 2025

mrfh92 commented Jan 27, 2025

JuanPedroGHM commented Jan 27, 2025 •

edited

Loading

mrfh92 commented Jan 27, 2025

github-actions bot commented Jan 27, 2025

JuanPedroGHM commented Jan 28, 2025

mrfh92 commented Jan 28, 2025

github-actions bot commented Feb 4, 2025

github-actions bot commented Feb 4, 2025

JuanPedroGHM commented Feb 4, 2025

github-actions bot commented Feb 5, 2025

JuanPedroGHM commented Feb 7, 2025

Large data counts support for MPI Communication #1765

Are you sure you want to change the base?

Large data counts support for MPI Communication #1765

Conversation

JuanPedroGHM commented Jan 22, 2025 • edited Loading

Due Diligence

Description

Changes proposed:

Type of change

Does this change modify the behaviour of other functions? If so, which?

github-actions bot commented Jan 22, 2025

codecov bot commented Jan 22, 2025 • edited Loading

Codecov Report

github-actions bot commented Jan 27, 2025

github-actions bot commented Jan 27, 2025

mrfh92 commented Jan 27, 2025

JuanPedroGHM commented Jan 27, 2025 • edited Loading

Benchmarks results - Sponsored by perun

mrfh92 commented Jan 27, 2025

github-actions bot commented Jan 27, 2025

JuanPedroGHM commented Jan 28, 2025

mrfh92 commented Jan 28, 2025

github-actions bot commented Feb 4, 2025

github-actions bot commented Feb 4, 2025

JuanPedroGHM commented Feb 4, 2025

github-actions bot commented Feb 5, 2025

JuanPedroGHM commented Feb 7, 2025

JuanPedroGHM commented Jan 22, 2025 •

edited

Loading

codecov bot commented Jan 22, 2025 •

edited

Loading

JuanPedroGHM commented Jan 27, 2025 •

edited

Loading