-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] commSplit Implementation #18
Conversation
cpp/include/raft/comms/std_comms.hpp
Outdated
} | ||
|
||
// non-root ranks of new comm recv unique id | ||
irecv(&id.internal, 128, root, color, requests.data() + request_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another hard-coded 128
here
@BradReesWork this PR led to some good discussions on slack, but unfortunately until we reach some sort of agreement on the path forward, this PR is blocked in the meantime. |
We'd better start more in-depth discussions on coming Monday, but I think we can start thinking about handle use cases (in particular, with sub-communicators). There are few questions we can try to answer first. We can mainly think of two different cases:
If we're running MG primitives or large SG primitives (large enough to use up all the single-GPU resources), I think we just need to consider case 1. If we're running multiple small SG primitives (too small for the capacity of a single GPU, so we need to run multiple primitives concurrently to fully use up the GPU resources), we need multiple handles but in this case, no need for communicators (there can be some pathological cases, e.g. if there are two primitives one requiring very large memory but very little computing and one requiring very little memory but very large computing, in this case, we can get some speedup running two MG primitives concurrently... but let's ignore this kind of seemingly very rare cases). So, if we can assume there is only one handle with communicator(s) per process (but there can be multiple handles without any communicator), I think we can simplify our discussions (if there is only one handle, cuGraph/cuML can create necessary (sub-)communicators on initialization, and we may not need to worry much or creating an excessive number of sub-communicators, cuGraph will create at most two sub-communicators). Another side question is, if we agree on one communicator handle per process, will there be one communicator handle per RAPIDS, or one communicator handle per cuML and another communicator handle per cuGraph? (so actually two communicator handles per process, I remember cuML is planning to create a child class for raft::handle_t, so we may not be able to do one communicator handle per RAPIDS). Another side question is, should we limit sub-communicators for only collectives? Regarding P2P, only one global communicator will be sufficient. Under our current implementation, this means sub-communicator == wrapper for NCCL sub-communicator (this may change if we adopt something like UCC https://www.ucfconsortium.org/projects/ucc/ for host collectives, UCC is collectives on UCX). I will post more on coming Monday. |
Regarding one handle per process: unfortunately we currently make no such constraint in cuml, either in sg nor mg. We currently create a new handle for each estimator, even in mg. It would definitely be nice to see more reuse of the handle across different estimator instances, though. Regarding NCCL-only subcommunicator: P2P is optional on the communicator and when P2P is enabled, the UCX endpoints are created once in UCX-py and reused. Do we gain anything by limiting the sub-communicator to NCCL-only if the global communicator was already initialized with p2p? The endpoints are just pointers that get pushed down to the new communicator. If the concern here is related to performance, for example, from creating the array with the subset of UCX endpoints, another option might be to provide a setting on the communicator creation that enables or disables the UCX endpoints from being injected into the subcommuncicator (and performs an assert when they are called). This seems like the most straightforward solution to me- it maintains the same interface and doesn't increase the confusion factor from unexpected behavior (comparing to MPI). What do you think? |
If there is no fundamental reasons we cannot adopt the one full handle (handle with communicator) per process approach, we can design sub-communicator support assuming this. If I am not mistaken, this will not affect the current cuML pipelines as cuML is not using sub-communicator. For cuGraph (cuGraph already adopts the one full handle per process approach), if we have an option to add sub-communicators to the full handle, we can work on 2D partitioning for the coming releases.
I don't know :-) This is mainly about simplicity. Not sure this really applies to our specific case, but in general, sharing increases software complexity. I guess sharing endpoints between the full communicator and sub-communicators increase complexity. While there is no harm in adding P2P support for sub-communicators, there is no need. If no need, I am inclined to a simpler approach. I assume a sub-communicator supporting only collectives will be simper than a sub-communicator supporting both collectives & P2P communications (with P2P with a sub-communicator, should we use a global rank or use a rank within a sub-communicator? Any synchronization is necessary if multiple communicators use the same endpoints? we don't need to think about this if sub-communicators are collectives only while we can do all the necessary P2P communication using the global communicator).
This is mainly about simplicity. If we don't need P2P with sub-communicators, why should we support? I will dig into the interface part, but if what you have suggested leads to simpler design, we should go for that. Just don't want to add software development/maintenance burden for unnecessary features. cuGraph just needs a means to add sub-communicator for collectives and use them when necessary with the one full handle per process approach. |
Related to https://github.com/rapidsai/raft/pull/18/files#diff-0835dc79c2c9697632a30a73323b374fR218
Yeah... I think we can go with 1) the current implementation, 2) skip P2P for sub-communicator, or 3) add a flag to include/exclude P2P support. All three sounds like a legitimate choice at this moment. And we can add something like Any concerns or any other issues to be discussed? cuGraph needs this to implement 2D partitioning of the graph adjacency matrix and let me know if there is anything else I can do to move this PR forward. |
Conflicts: CHANGELOG.md cpp/include/raft/comms/std_comms.hpp
@seunghwak, This should be ready for another (hopefully final) review. I've added a new argument I like your idea regarding the addition of a hashmap to the handle to store instances of subcommunicators. What do you think of doing that in a follow-on PR since this PR has already grown quite large? |
Sure, we can do this in additional PR (and I think it's actually better as it is beyond commSplit implementation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for taking care of this issue, this will be really helpful!!!
If @afender is happy with these changes then I think they are good to go. I'll try and create a PR for the |
A simple commSplit implementation with associated end-to-end pytest that creates subcommunicators and performs separate allreduces on the subcommunicators.