Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]Running cuML with NCCL and MPI #2351

Closed
aj-prime opened this issue Jun 1, 2020 · 2 comments
Closed

[QST]Running cuML with NCCL and MPI #2351

aj-prime opened this issue Jun 1, 2020 · 2 comments
Labels
? - Needs Triage Need team to review and classify question Further information is requested

Comments

@aj-prime
Copy link

aj-prime commented Jun 1, 2020

I was wondering how I can run cuML with NCCL and MPI.

I am able to run a distributed sample code (nearest neighbor given as an example) on two workers.
It looks like this code is using pt2pt UCX communication. Is there any algorithm that uses NCCL collectives?

@aj-prime aj-prime added ? - Needs Triage Need team to review and classify question Further information is requested labels Jun 1, 2020
@cjnolet
Copy link
Member

cjnolet commented Jun 1, 2020

Hi @aj-prime,

All of the distributed algorithms that use the C++ communications layer API use NCCL currently. For example, the distributed nearest neighbors implementation uses it to broadcast batches to other ranks. UCX is used there to gather the results of each batch since the NCCL public API doesn't have a gather

Are you wanting to run the cuML distributed algorithms directly in MPI without going through Python / Dask? We have a couple examples that can be executed using the MPI/NCCL backend, which will use MPI to bootstrap a NCCL clique and then use NCCL for collectives. If you use a CUDA-aware MPI implementation that was built with UCX support, you'll get the equivalent of our NCCL/UCX backend.

You should be able to run the examples above in MPI by creating a cumlHandle and injecting an instance of the MPI communicator into it. Finally, if you are interested in using this from the Python layer and Dask, this pytest might be a good place to start.

Let us know if you encounter any issues. It's also worth mentioning that our communications layer is in the beginning stages of being transferred to a new project named raft.

@cjnolet
Copy link
Member

cjnolet commented Jun 3, 2020

@aj-prime, assuming the above answers your question, I'm going to close this for now. Please feel free to open this back up in you have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants