Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable GPU-aware MPI when performance conditions are met #2967

Closed
BenWibking opened this issue Sep 29, 2022 · 21 comments · Fixed by #4318
Closed

enable GPU-aware MPI when performance conditions are met #2967

BenWibking opened this issue Sep 29, 2022 · 21 comments · Fixed by #4318

Comments

@BenWibking
Copy link
Contributor

BenWibking commented Sep 29, 2022

GPU-aware MPI significantly improves performance over the default host-pinned buffers in AMReX if two conditions are satisfied:

  1. CUDA_VISIBLE_DEVICES is not set. (e.g., when using SLURM, --gpu-bind=none)
  2. Managed memory is not used (setting amrex.the_arena_is_managed=0)

If both of these are satisfied, then OpenMPI (at least) uses CUDA IPC (using the UCX cuda_ipc transport) to perform device-to-device copies over NVLink. In my tests, this is significantly faster than using AMReX's host-pinned buffers on A100 + NVLink systems, leading to on-node scaling that is essentially perfect (95-99% scaling efficiency).

Scaling tests: quokka-astro/quokka#121
OpenMPI issue: open-mpi/ompi#10871

@WeiqunZhang
Copy link
Member

Thanks for the information! Nice scaling results.

In AMReX, we know the number of visible GPU devices. (

AMREX_HIP_OR_CUDA(AMREX_HIP_SAFE_CALL (hipGetDeviceCount(&gpu_device_count));,
). I think we should be able to change the default at runtime based on that.

I can do a draft PR. Would you be able to help us to test? Thank you in advance.

@BenWibking
Copy link
Contributor Author

I can definitely help test. Thanks!

@BenWibking BenWibking changed the title GPU-aware MPI should be enabled if GPU binding is properly set and managed memory is disabled enable GPU-aware MPI when performance conditions are met Sep 29, 2022
@maximumcats
Copy link
Member

This is one where the multi-node behavior may be significantly different than the intra-node behavior. It is also the case that different MPI implementations are different with respect to CUDA-aware MPI and networking hardware varies. So you would want to test this on multiple setups and at multiple scales before changing the default. (There is also the wrinkle that MPI builds may not always have GPU awareness turned on, although this is less frequent lately.)

@BenWibking
Copy link
Contributor Author

BenWibking commented Sep 29, 2022

I've queued 8-node and 64-node runs on NCSA Delta (which has Slingshot, but only 1 NIC per node). Unfortunately, this cluster only has OpenMPI (it's supposed to have the Cray environment, eventually). Other people will have to test other configurations.

Edit: On 8 nodes, CUDA-aware is still a win, but only by 3%. This might be due to the fact that there is only 1 NIC per node, but 4 GPUs per node on this system.

@WeiqunZhang
Copy link
Member

I did some tests on perlmutter. On a single node, the communication was about 10-20% faster with cuda-aware mpi. But on 8 nodes, it was actually slightly slower.

@BenWibking
Copy link
Contributor Author

That's interesting. I assume this was with Cray MPI?

I will be able to test on Infiniband + V100 + OpenMPI tomorrow.

@BenWibking
Copy link
Contributor Author

BenWibking commented Sep 30, 2022

On V100/Infiniband/OpenMPI, GPU-aware on a single node is a significant improvement, and on 8 nodes it is a ~3% improvement. So on all the machines I currently have access to, GPU-aware always wins.

It would be good to know if this is something only seen on OpenMPI, or if there's another explanation. Also unknown whether this applies to AMD devices at all.

Edit: On a 8x MI100 node with OpenMPI, GPU-awareness improves performance by ~10% compared to host pinned buffers. I don't have access to a multi-node AMD system to test the multi-node case. GPU-aware performance does not appear to be affected by the GPU binding.

@BenWibking
Copy link
Contributor Author

I did some tests on perlmutter. On a single node, the communication was about 10-20% faster with cuda-aware mpi. But on 8 nodes, it was actually slightly slower.

@WeiqunZhang Does lowering the value of MPICH_GPU_IPC_THRESHOLD change this result? It looks like the default is 8192: https://www.olcf.ornl.gov/wp-content/uploads/2021/04/HPE-Cray-MPI-Update-nfr-presented.pdf

@BenWibking
Copy link
Contributor Author

On 64 nodes on NCSA Delta, I get a 13% performance improvement with CUDA-aware MPI over host pinned buffers on a hydro test problem. This is with OpenMPI+UCX for now. It will be interesting to see whether Cray MPI performance is significantly different.

@kngott
Copy link
Contributor

kngott commented Oct 3, 2022

Which NCSA Delta partition was this? The 4 GPU A100 nodes, or a different partition?

@BenWibking
Copy link
Contributor Author

Which NCSA Delta partition was this? The 4 GPU A100 nodes, or a different partition?

This was the A100x4 partition.

@kngott
Copy link
Contributor

kngott commented Oct 3, 2022

👍
Do you also happen to know (or can you find out) how many NICs it has per node and can you confirm that's Slingshot 10?

@BenWibking
Copy link
Contributor Author

👍 Do you also happen to know (or can you find out) how many NICs it has per node and can you confirm that's Slingshot 10?

It has 1 NIC per node. I'll check whether it's Slingshot 10 or SS11, not sure about that offhand.

@BenWibking
Copy link
Contributor Author

It's currently Slingshot 10.

@kngott
Copy link
Contributor

kngott commented Oct 3, 2022

Makes sense, thanks!

So, sounds like the strongest possibilities are either affinity differences, or the OpenMPI+UCX implementation of CUDA-Aware is better. It would be really good to lock down the causes so AMReX can make some informed decisions and we could pass this along to the NERSC and/or Illinois teams to start some adjustments and discussions.

OpenMPI+UCX currently doesn't exist on Perlmutter. Is there an MPICH implementation on NCSA Delta?

One other general thing: We should probably make sure we're testing comms with amrex.use_profiler_syncs = 1 (#2762). That turns on a sync immediately before FillBoundary, ParallelCopy and Redistribute to make sure the corresponding comms timers accurately measure the comm performance and not performance variations elsewhere that are just captured in the comm timers because of their internal sync points.

@BenWibking
Copy link
Contributor Author

OpenMPI+UCX currently doesn't exist on Perlmutter. Is there an MPICH implementation on NCSA Delta?

No, not at the moment.

One other general thing: We should probably make sure we're testing comms with amrex.use_profiler_syncs = 1 (#2762). That turns on a sync immediately before FillBoundary, ParallelCopy and Redistribute to make sure the corresponding comms timers accurately measure the comm performance and not performance variations elsewhere that are just captured in the comm timers because of their internal sync points.

For my case, I've been comparing the total cell updates for a hydro test problem, rather than looking at the comm time itself.

@kngott
Copy link
Contributor

kngott commented Oct 3, 2022

For the use_profiler_syncs=1 : Makes sense to me. Just a note for us for future testing.

For the MPICH: yeah, that tracks: two systems, each with a different, unique MPI implementation getting different results. Couldn't be easy, could it? 😄

Thanks for all your work on this, Ben!

@BenWibking
Copy link
Contributor Author

I've obtained access and run tests on Crusher. I can share those over email.

WeiqunZhang pushed a commit that referenced this issue Jun 13, 2023
## Summary
This change suggested by @WeiqunZhang points `the_fa_arena` to
`The_Device_Arena` when activating GPU-aware MPI. This obviates the need
for setting `the_arena_is_managed=0` to take advantage of GPU-aware MPI
since it does not work well with managed memory.

## Additional background
The motivation for this PR is that this was an long-pending change but
the immediate trigger was finding that GPU-aware MPI can reduce
communication times significantly but that currently needs setting
`the_arena_is_managed=0`. Not setting this for GPU-aware MPI currently
results in degraded performance.
Past discussion on GPU-aware MPI: #2967 

## Preliminary performance test
Running 100 steps on 8 GPUs over 2 Perlmutter A100 nodes with
`Tests/GPU/CNS/Exec/Sod`, `amr.n_cell = 128^3` per GPU,
`amr.max_grid_size = 128`, `amrex.use_profiler_syncs = 1` and setting
optimal GPU affinities.

### Without `amrex.use_gpu_aware_mpi=1`
```
FabArray::ParallelCopy_nowait()                200      0.133     0.1779     0.2067  17.82%
FabArray::ParallelCopy_finish()                200    0.07822     0.1193     0.1786  15.40%
```

### With `amrex.use_gpu_aware_mpi=1`
```
FabArray::ParallelCopy_nowait()                200    0.05655    0.07633     0.1034  11.20%
FabArray::ParallelCopy_finish()                200    0.03969    0.06087    0.09024   9.77%
```

Co-authored-by: Mukul Dave <[email protected]>
@BenWibking
Copy link
Contributor Author

At least for current-generation GPU systems, it appears the root cause of the issue is cgroup isolation of GPUs on the same node preventing use of CUDA/ROCm IPC: open-mpi/ompi#11949.

@WeiqunZhang Since there is the communication arena now, does it make sense to enable GPU-aware MPI by default in AMReX?

@maximumcats
Copy link
Member

This issue is more complex than whether or not IPC is available, since a large number of cases relevant to AMReX users are going to be on large (100s to 1000s of servers) configurations so it's a balance of effects between IPC and RDMA. The default should be determined by looking at both small scale (~10 servers) and large scale (~100-1000 servers) benchmark runs on the major GPU systems we care about and determining which is better on average.

@WeiqunZhang
Copy link
Member

Maybe we can add a build option that allows people to change the default at build time.

BenWibking added a commit to BenWibking/amrex that referenced this issue Jan 29, 2025
This turns on GPU-aware MPI by default.

On all current machines, simulations run faster with GPU-aware MPI enabled. Two technical issues that prevented this are now resolved: AMReX now has the communication arena, which does not use managed memory, and SLURM no longer uses cgroup isolation for GPU bindings by default.

Closes AMReX-Codes#2967.
WeiqunZhang added a commit that referenced this issue Feb 3, 2025
## Summary
This turns on GPU-aware MPI by default.

## Additional background
On all current machines, simulations run faster with GPU-aware MPI
enabled. Two technical issues that prevented this are now resolved:
AMReX now has the communication arena, which does not use managed
memory, and SLURM no longer uses cgroup isolation for GPU bindings by
default.

Closes #2967.

---------

Co-authored-by: Weiqun Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants