Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building cuda aware openMPI does not seem to work #12334

Closed
PhilipDeegan opened this issue Feb 14, 2024 · 10 comments
Closed

Building cuda aware openMPI does not seem to work #12334

PhilipDeegan opened this issue Feb 14, 2024 · 10 comments

Comments

@PhilipDeegan
Copy link

with version 5.0.2 from https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.2.tar.bz2
UCX v1.15.0 from https://github.com/openucx/ucx

Using the guide at https://www.open-mpi.org/faq/?category=buildcuda
but without gdrcopy

always results in the support for cuda showing false, yet the extension modules lists cuda

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:fals
ompi_info | grep 'MPI ext'
          MPI extensions: affinity, cuda, ftmpi, rocm
@PhilipDeegan
Copy link
Author

adding --with-cuda-libdir=/usr/local/cuda/lib64 to ./configure doesn't work either

this is not in the doc https://www.open-mpi.org/faq/?category=buildcuda

but is supposedly necessary from #12264

@jsquyres
Copy link
Member

Check out https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#how-do-i-build-open-mpi-with-cuda-aware-support. These docs aren't the greatest, but I think they should solve your issue. FWIW, @janjust is looking at improving the situation so that --with-cuda-libdir isn't necessary.

@PhilipDeegan
Copy link
Author

in this linked article there seems to be some conflicts

11.2.6.2. How do I verify that Open MPI has been built with CUDA support?[](https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#how-do-i-verify-that-open-mpi-has-been-built-with-cuda-support)

and

11.2.6.8. How can I tell if Open MPI was built with CUDA support?[](https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#how-can-i-tell-if-open-mpi-was-built-with-cuda-support)

show different methods, and for me the first is what I see when I build, but not the second, so I will just try and test the binaries to see if it works or not

@jsquyres
Copy link
Member

in this linked article there seems to be some conflicts

@janjust @hppritcha FYI -- might want to make these docs better / more clear.

@wenduwan
Copy link
Contributor

Related #12137

@PhilipDeegan
Copy link
Author

PhilipDeegan commented Feb 20, 2024

testing out the attempted cuda aware libmpi I get a segfault

[8177e2b19878:385472:0:385472] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd85f200000)
==== backtrace (tid: 385472) ====
 0  /opt/mpi/cuda/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fd8965c8d84]
 1  /opt/mpi/cuda/lib/libucs.so.0(+0x33f7f) [0x7fd8965c8f7f]
 2  /opt/mpi/cuda/lib/libucs.so.0(+0x34266) [0x7fd8965c9266]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fd896db3520]
 4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a0941) [0x7fd896f11941]
 5  /opt/mpi/cuda/lib/libopen-pal.so.80(+0x6b005) [0x7fd8967dd005]
 6  /opt/mpi/cuda/lib/libmpi.so.40(ompi_datatype_sndrcv+0x50a) [0x7fd897355dca]
 7  /opt/mpi/cuda/lib/libmpi.so.40(PMPI_Allgather+0x11e) [0x7fd897356e0e]

This same code does indeed work for the OpenMPI ROCM version

build script here https://github.com/PHARCHIVE/phare-mpi/blob/ompi/phare-mpi-cuda/build.sh

/opt/mpi/cuda/bin/ompi_info | grep ext
          MPI extensions: affinity, cuda, ftmpi, rocm

@PhilipDeegan
Copy link
Author

ok I got it working

it appears that the --with-cuda-libdir=...

must be --with-cuda-libdir=/usr/local/cuda/lib64/stubs

and not --with-cuda-libdir=/usr/local/cuda/lib64

@shoveller86
Copy link

it appears that the --with-cuda-libdir=...

must be --with-cuda-libdir=/usr/local/cuda/lib64/stubs

Thanks. Passing the flag --with-cuda-libdir=/usr/local/cuda/lib64/stubs when configure.

The macro MPIX_CUDA_AWARE_SUPPORT becomes 1 in file openmpi/mpiext/mpiext_cuda_c.h

#define MPIX_CUDA_AWARE_SUPPORT 1
OMPI_DECLSPEC int MPIX_Query_cuda_support(void);

The impl in C++:

#include "ompi_config.h"

#include <stdio.h>
#include <string.h>

#include "opal/mca/accelerator/base/base.h"
#include "ompi/mpiext/cuda/c/mpiext_cuda_c.h"

/* If CUDA-aware support is configured in, return 1. Otherwise, return 0.
 * This API may be extended to return more features in the future. */
int MPIX_Query_cuda_support(void)
{
    return 0 == strcmp(opal_accelerator_base_selected_component.base_version.mca_component_name, "cuda");
}

But when I run the command to check: ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
It still show cuda awareness failed: mca:mpi:base:param:mpi_built_with_cuda_support:value:false

@PhilipDeegan
Copy link
Author

@ggouaillardet
Copy link
Contributor

@shoveller86 what if you ompi_info --all | grep cuda_support?

can you also compress and share your config.log?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants