-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable GPU-aware MPI when performance conditions are met #2967
Comments
Thanks for the information! Nice scaling results. In AMReX, we know the number of visible GPU devices. ( amrex/Src/Base/AMReX_GpuDevice.cpp Line 163 in e55d6b4
I can do a draft PR. Would you be able to help us to test? Thank you in advance. |
I can definitely help test. Thanks! |
This is one where the multi-node behavior may be significantly different than the intra-node behavior. It is also the case that different MPI implementations are different with respect to CUDA-aware MPI and networking hardware varies. So you would want to test this on multiple setups and at multiple scales before changing the default. (There is also the wrinkle that MPI builds may not always have GPU awareness turned on, although this is less frequent lately.) |
I've queued 8-node and 64-node runs on NCSA Delta (which has Slingshot, but only 1 NIC per node). Unfortunately, this cluster only has OpenMPI (it's supposed to have the Cray environment, eventually). Other people will have to test other configurations. Edit: On 8 nodes, CUDA-aware is still a win, but only by 3%. This might be due to the fact that there is only 1 NIC per node, but 4 GPUs per node on this system. |
I did some tests on perlmutter. On a single node, the communication was about 10-20% faster with cuda-aware mpi. But on 8 nodes, it was actually slightly slower. |
That's interesting. I assume this was with Cray MPI? I will be able to test on Infiniband + V100 + OpenMPI tomorrow. |
On V100/Infiniband/OpenMPI, GPU-aware on a single node is a significant improvement, and on 8 nodes it is a ~3% improvement. So on all the machines I currently have access to, GPU-aware always wins. It would be good to know if this is something only seen on OpenMPI, or if there's another explanation. Edit: On a 8x MI100 node with OpenMPI, GPU-awareness improves performance by ~10% compared to host pinned buffers. I don't have access to a multi-node AMD system to test the multi-node case. GPU-aware performance does not appear to be affected by the GPU binding. |
@WeiqunZhang Does lowering the value of |
On 64 nodes on NCSA Delta, I get a 13% performance improvement with CUDA-aware MPI over host pinned buffers on a hydro test problem. This is with OpenMPI+UCX for now. It will be interesting to see whether Cray MPI performance is significantly different. |
Which NCSA Delta partition was this? The 4 GPU A100 nodes, or a different partition? |
This was the A100x4 partition. |
👍 |
It has 1 NIC per node. I'll check whether it's Slingshot 10 or SS11, not sure about that offhand. |
It's currently Slingshot 10. |
Makes sense, thanks! So, sounds like the strongest possibilities are either affinity differences, or the OpenMPI+UCX implementation of CUDA-Aware is better. It would be really good to lock down the causes so AMReX can make some informed decisions and we could pass this along to the NERSC and/or Illinois teams to start some adjustments and discussions. OpenMPI+UCX currently doesn't exist on Perlmutter. Is there an MPICH implementation on NCSA Delta? One other general thing: We should probably make sure we're testing comms with |
No, not at the moment.
For my case, I've been comparing the total cell updates for a hydro test problem, rather than looking at the comm time itself. |
For the For the MPICH: yeah, that tracks: two systems, each with a different, unique MPI implementation getting different results. Couldn't be easy, could it? 😄 Thanks for all your work on this, Ben! |
I've obtained access and run tests on Crusher. I can share those over email. |
## Summary This change suggested by @WeiqunZhang points `the_fa_arena` to `The_Device_Arena` when activating GPU-aware MPI. This obviates the need for setting `the_arena_is_managed=0` to take advantage of GPU-aware MPI since it does not work well with managed memory. ## Additional background The motivation for this PR is that this was an long-pending change but the immediate trigger was finding that GPU-aware MPI can reduce communication times significantly but that currently needs setting `the_arena_is_managed=0`. Not setting this for GPU-aware MPI currently results in degraded performance. Past discussion on GPU-aware MPI: #2967 ## Preliminary performance test Running 100 steps on 8 GPUs over 2 Perlmutter A100 nodes with `Tests/GPU/CNS/Exec/Sod`, `amr.n_cell = 128^3` per GPU, `amr.max_grid_size = 128`, `amrex.use_profiler_syncs = 1` and setting optimal GPU affinities. ### Without `amrex.use_gpu_aware_mpi=1` ``` FabArray::ParallelCopy_nowait() 200 0.133 0.1779 0.2067 17.82% FabArray::ParallelCopy_finish() 200 0.07822 0.1193 0.1786 15.40% ``` ### With `amrex.use_gpu_aware_mpi=1` ``` FabArray::ParallelCopy_nowait() 200 0.05655 0.07633 0.1034 11.20% FabArray::ParallelCopy_finish() 200 0.03969 0.06087 0.09024 9.77% ``` Co-authored-by: Mukul Dave <[email protected]>
At least for current-generation GPU systems, it appears the root cause of the issue is cgroup isolation of GPUs on the same node preventing use of CUDA/ROCm IPC: open-mpi/ompi#11949. @WeiqunZhang Since there is the communication arena now, does it make sense to enable GPU-aware MPI by default in AMReX? |
This issue is more complex than whether or not IPC is available, since a large number of cases relevant to AMReX users are going to be on large (100s to 1000s of servers) configurations so it's a balance of effects between IPC and RDMA. The default should be determined by looking at both small scale (~10 servers) and large scale (~100-1000 servers) benchmark runs on the major GPU systems we care about and determining which is better on average. |
Maybe we can add a build option that allows people to change the default at build time. |
This turns on GPU-aware MPI by default. On all current machines, simulations run faster with GPU-aware MPI enabled. Two technical issues that prevented this are now resolved: AMReX now has the communication arena, which does not use managed memory, and SLURM no longer uses cgroup isolation for GPU bindings by default. Closes AMReX-Codes#2967.
## Summary This turns on GPU-aware MPI by default. ## Additional background On all current machines, simulations run faster with GPU-aware MPI enabled. Two technical issues that prevented this are now resolved: AMReX now has the communication arena, which does not use managed memory, and SLURM no longer uses cgroup isolation for GPU bindings by default. Closes #2967. --------- Co-authored-by: Weiqun Zhang <[email protected]>
GPU-aware MPI significantly improves performance over the default host-pinned buffers in AMReX if two conditions are satisfied:
CUDA_VISIBLE_DEVICES
is not set. (e.g., when using SLURM,--gpu-bind=none
)amrex.the_arena_is_managed=0
)If both of these are satisfied, then OpenMPI (at least) uses CUDA IPC (using the UCX
cuda_ipc
transport) to perform device-to-device copies over NVLink. In my tests, this is significantly faster than using AMReX's host-pinned buffers on A100 + NVLink systems, leading to on-node scaling that is essentially perfect (95-99% scaling efficiency).Scaling tests: quokka-astro/quokka#121
OpenMPI issue: open-mpi/ompi#10871
The text was updated successfully, but these errors were encountered: