enable GPU-aware MPI when performance conditions are met #2967

BenWibking · 2022-09-29T17:31:31Z

GPU-aware MPI significantly improves performance over the default host-pinned buffers in AMReX if two conditions are satisfied:

CUDA_VISIBLE_DEVICES is not set. (e.g., when using SLURM, --gpu-bind=none)
Managed memory is not used (setting amrex.the_arena_is_managed=0)

If both of these are satisfied, then OpenMPI (at least) uses CUDA IPC (using the UCX cuda_ipc transport) to perform device-to-device copies over NVLink. In my tests, this is significantly faster than using AMReX's host-pinned buffers on A100 + NVLink systems, leading to on-node scaling that is essentially perfect (95-99% scaling efficiency).

Scaling tests: quokka-astro/quokka#121
OpenMPI issue: open-mpi/ompi#10871

The text was updated successfully, but these errors were encountered:

WeiqunZhang · 2022-09-29T17:52:58Z

Thanks for the information! Nice scaling results.

In AMReX, we know the number of visible GPU devices. (

amrex/Src/Base/AMReX_GpuDevice.cpp

Line 163 in e55d6b4

AMREX_HIP_OR_CUDA(AMREX_HIP_SAFE_CALL (hipGetDeviceCount(&gpu_device_count));,

). I think we should be able to change the default at runtime based on that.

I can do a draft PR. Would you be able to help us to test? Thank you in advance.

BenWibking · 2022-09-29T17:56:12Z

I can definitely help test. Thanks!

maximumcats · 2022-09-29T18:18:54Z

This is one where the multi-node behavior may be significantly different than the intra-node behavior. It is also the case that different MPI implementations are different with respect to CUDA-aware MPI and networking hardware varies. So you would want to test this on multiple setups and at multiple scales before changing the default. (There is also the wrinkle that MPI builds may not always have GPU awareness turned on, although this is less frequent lately.)

BenWibking · 2022-09-29T18:56:44Z

I've queued 8-node and 64-node runs on NCSA Delta (which has Slingshot, but only 1 NIC per node). Unfortunately, this cluster only has OpenMPI (it's supposed to have the Cray environment, eventually). Other people will have to test other configurations.

Edit: On 8 nodes, CUDA-aware is still a win, but only by 3%. This might be due to the fact that there is only 1 NIC per node, but 4 GPUs per node on this system.

WeiqunZhang · 2022-09-29T22:29:45Z

I did some tests on perlmutter. On a single node, the communication was about 10-20% faster with cuda-aware mpi. But on 8 nodes, it was actually slightly slower.

BenWibking · 2022-09-30T00:45:34Z

That's interesting. I assume this was with Cray MPI?

I will be able to test on Infiniband + V100 + OpenMPI tomorrow.

BenWibking · 2022-09-30T15:47:19Z

On V100/Infiniband/OpenMPI, GPU-aware on a single node is a significant improvement, and on 8 nodes it is a ~3% improvement. So on all the machines I currently have access to, GPU-aware always wins.

It would be good to know if this is something only seen on OpenMPI, or if there's another explanation. ~~Also unknown whether this applies to AMD devices at all.~~

Edit: On a 8x MI100 node with OpenMPI, GPU-awareness improves performance by ~10% compared to host pinned buffers. I don't have access to a multi-node AMD system to test the multi-node case. GPU-aware performance does not appear to be affected by the GPU binding.

BenWibking · 2022-10-01T01:02:55Z

I did some tests on perlmutter. On a single node, the communication was about 10-20% faster with cuda-aware mpi. But on 8 nodes, it was actually slightly slower.

@WeiqunZhang Does lowering the value of MPICH_GPU_IPC_THRESHOLD change this result? It looks like the default is 8192: https://www.olcf.ornl.gov/wp-content/uploads/2021/04/HPE-Cray-MPI-Update-nfr-presented.pdf

BenWibking · 2022-10-01T21:56:49Z

On 64 nodes on NCSA Delta, I get a 13% performance improvement with CUDA-aware MPI over host pinned buffers on a hydro test problem. This is with OpenMPI+UCX for now. It will be interesting to see whether Cray MPI performance is significantly different.

kngott · 2022-10-03T17:57:02Z

Which NCSA Delta partition was this? The 4 GPU A100 nodes, or a different partition?

BenWibking · 2022-10-03T19:17:22Z

Which NCSA Delta partition was this? The 4 GPU A100 nodes, or a different partition?

This was the A100x4 partition.

kngott · 2022-10-03T19:24:10Z

👍
Do you also happen to know (or can you find out) how many NICs it has per node and can you confirm that's Slingshot 10?

BenWibking · 2022-10-03T19:29:03Z

👍 Do you also happen to know (or can you find out) how many NICs it has per node and can you confirm that's Slingshot 10?

It has 1 NIC per node. I'll check whether it's Slingshot 10 or SS11, not sure about that offhand.

BenWibking · 2022-10-03T19:46:35Z

It's currently Slingshot 10.

kngott · 2022-10-03T20:17:58Z

Makes sense, thanks!

So, sounds like the strongest possibilities are either affinity differences, or the OpenMPI+UCX implementation of CUDA-Aware is better. It would be really good to lock down the causes so AMReX can make some informed decisions and we could pass this along to the NERSC and/or Illinois teams to start some adjustments and discussions.

OpenMPI+UCX currently doesn't exist on Perlmutter. Is there an MPICH implementation on NCSA Delta?

One other general thing: We should probably make sure we're testing comms with amrex.use_profiler_syncs = 1 (#2762). That turns on a sync immediately before FillBoundary, ParallelCopy and Redistribute to make sure the corresponding comms timers accurately measure the comm performance and not performance variations elsewhere that are just captured in the comm timers because of their internal sync points.

BenWibking · 2022-10-03T20:43:44Z

OpenMPI+UCX currently doesn't exist on Perlmutter. Is there an MPICH implementation on NCSA Delta?

No, not at the moment.

One other general thing: We should probably make sure we're testing comms with amrex.use_profiler_syncs = 1 (#2762). That turns on a sync immediately before FillBoundary, ParallelCopy and Redistribute to make sure the corresponding comms timers accurately measure the comm performance and not performance variations elsewhere that are just captured in the comm timers because of their internal sync points.

For my case, I've been comparing the total cell updates for a hydro test problem, rather than looking at the comm time itself.

kngott · 2022-10-03T20:54:52Z

For the use_profiler_syncs=1 : Makes sense to me. Just a note for us for future testing.

For the MPICH: yeah, that tracks: two systems, each with a different, unique MPI implementation getting different results. Couldn't be easy, could it? 😄

Thanks for all your work on this, Ben!

BenWibking · 2022-10-06T19:20:22Z

I've obtained access and run tests on Crusher. I can share those over email.

@WeiqunZhang

## Summary This change suggested by @WeiqunZhang points `the_fa_arena` to `The_Device_Arena` when activating GPU-aware MPI. This obviates the need for setting `the_arena_is_managed=0` to take advantage of GPU-aware MPI since it does not work well with managed memory. ## Additional background The motivation for this PR is that this was an long-pending change but the immediate trigger was finding that GPU-aware MPI can reduce communication times significantly but that currently needs setting `the_arena_is_managed=0`. Not setting this for GPU-aware MPI currently results in degraded performance. Past discussion on GPU-aware MPI: #2967 ## Preliminary performance test Running 100 steps on 8 GPUs over 2 Perlmutter A100 nodes with `Tests/GPU/CNS/Exec/Sod`, `amr.n_cell = 128^3` per GPU, `amr.max_grid_size = 128`, `amrex.use_profiler_syncs = 1` and setting optimal GPU affinities. ### Without `amrex.use_gpu_aware_mpi=1` ``` FabArray::ParallelCopy_nowait() 200 0.133 0.1779 0.2067 17.82% FabArray::ParallelCopy_finish() 200 0.07822 0.1193 0.1786 15.40% ``` ### With `amrex.use_gpu_aware_mpi=1` ``` FabArray::ParallelCopy_nowait() 200 0.05655 0.07633 0.1034 11.20% FabArray::ParallelCopy_finish() 200 0.03969 0.06087 0.09024 9.77% ``` Co-authored-by: Mukul Dave <[email protected]>

BenWibking · 2023-10-01T12:39:03Z

At least for current-generation GPU systems, it appears the root cause of the issue is cgroup isolation of GPUs on the same node preventing use of CUDA/ROCm IPC: open-mpi/ompi#11949.

@WeiqunZhang Since there is the communication arena now, does it make sense to enable GPU-aware MPI by default in AMReX?

maximumcats · 2023-10-01T15:07:55Z

This issue is more complex than whether or not IPC is available, since a large number of cases relevant to AMReX users are going to be on large (100s to 1000s of servers) configurations so it's a balance of effects between IPC and RDMA. The default should be determined by looking at both small scale (~10 servers) and large scale (~100-1000 servers) benchmark runs on the major GPU systems we care about and determining which is better on average.

WeiqunZhang · 2023-10-06T16:28:35Z

Maybe we can add a build option that allows people to change the default at build time.

This turns on GPU-aware MPI by default. On all current machines, simulations run faster with GPU-aware MPI enabled. Two technical issues that prevented this are now resolved: AMReX now has the communication arena, which does not use managed memory, and SLURM no longer uses cgroup isolation for GPU bindings by default. Closes AMReX-Codes#2967.

## Summary This turns on GPU-aware MPI by default. ## Additional background On all current machines, simulations run faster with GPU-aware MPI enabled. Two technical issues that prevented this are now resolved: AMReX now has the communication arena, which does not use managed memory, and SLURM no longer uses cgroup isolation for GPU bindings by default. Closes #2967. --------- Co-authored-by: Weiqun Zhang <[email protected]>

BenWibking changed the title ~~GPU-aware MPI should be enabled if GPU binding is properly set and managed memory is disabled~~ enable GPU-aware MPI when performance conditions are met Sep 29, 2022

mukul1992 mentioned this issue Jun 12, 2023

Use device arena for the_fa_arena when activating GPU-aware MPI #3362

Merged

BenWibking mentioned this issue Jan 29, 2025

Enable GPU-aware MPI by default #4318

Merged

5 tasks

WeiqunZhang closed this as completed in #4318 Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable GPU-aware MPI when performance conditions are met #2967

enable GPU-aware MPI when performance conditions are met #2967

BenWibking commented Sep 29, 2022 •

edited

Loading

WeiqunZhang commented Sep 29, 2022

BenWibking commented Sep 29, 2022

maximumcats commented Sep 29, 2022

BenWibking commented Sep 29, 2022 •

edited

Loading

WeiqunZhang commented Sep 29, 2022

BenWibking commented Sep 30, 2022

BenWibking commented Sep 30, 2022 •

edited

Loading

BenWibking commented Oct 1, 2022

BenWibking commented Oct 1, 2022

kngott commented Oct 3, 2022

BenWibking commented Oct 3, 2022

kngott commented Oct 3, 2022 •

edited

Loading

BenWibking commented Oct 3, 2022

BenWibking commented Oct 3, 2022

kngott commented Oct 3, 2022

BenWibking commented Oct 3, 2022

kngott commented Oct 3, 2022

BenWibking commented Oct 6, 2022

BenWibking commented Oct 1, 2023

maximumcats commented Oct 1, 2023

WeiqunZhang commented Oct 6, 2023

enable GPU-aware MPI when performance conditions are met #2967

enable GPU-aware MPI when performance conditions are met #2967

Comments

BenWibking commented Sep 29, 2022 • edited Loading

WeiqunZhang commented Sep 29, 2022

BenWibking commented Sep 29, 2022

maximumcats commented Sep 29, 2022

BenWibking commented Sep 29, 2022 • edited Loading

WeiqunZhang commented Sep 29, 2022

BenWibking commented Sep 30, 2022

BenWibking commented Sep 30, 2022 • edited Loading

BenWibking commented Oct 1, 2022

BenWibking commented Oct 1, 2022

kngott commented Oct 3, 2022

BenWibking commented Oct 3, 2022

kngott commented Oct 3, 2022 • edited Loading

BenWibking commented Oct 3, 2022

BenWibking commented Oct 3, 2022

kngott commented Oct 3, 2022

BenWibking commented Oct 3, 2022

kngott commented Oct 3, 2022

BenWibking commented Oct 6, 2022

BenWibking commented Oct 1, 2023

maximumcats commented Oct 1, 2023

WeiqunZhang commented Oct 6, 2023

BenWibking commented Sep 29, 2022 •

edited

Loading

BenWibking commented Sep 29, 2022 •

edited

Loading

BenWibking commented Sep 30, 2022 •

edited

Loading

kngott commented Oct 3, 2022 •

edited

Loading