-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add device_send, device_recv, device_sendrecv, device_multicast_sendrecv #144
Conversation
seunghwak
commented
Feb 9, 2021
- Undo temporarily exposing a RAFT communication object's private NCCL communicator.
- Add device_send/device_recv (if sending or receiving), device_sendrecv (if sending and receiving), device_multicast_sendrecv (if sending and receiving multiple messages).
- Add test suites for newly added raft::comms_t routines.
} | ||
|
||
/** | ||
* Performs a multicast send/receive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is there an MPI equivalent for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI has all-to-all, but AFAIK no multicast receiving data from a subset of the nodes and sending data to another subset.
And this multicast is actually ncclSend/ncclRecv operations placed inside ncclGroupStart() & ncclGroupEnd().
std::vector<size_t> const &recvoffsets, | ||
std::vector<int> const &sources, | ||
cudaStream_t stream) const { | ||
// ncclSend/ncclRecv pair needs to be inside ncclGroupStart/ncclGroupEnd to avoid deadlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice do these transfers get serialized on the same stream? The API doesn't seem to allow the backend to run on multiple streams.
Thinking about the case where sendsizes.size() is large but sendizes[i] are relatively small. It would be interesting to see if this benefits from concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the send/receive operations are placed inside ncclGroupStart() and ncclGroupEnd(), so AFAIK, all the send/receive operations are executed concurrently (at least logically, NCCL may or may not restrict parallelism internally to avoid congestion based on the interconnect) after ncclGroupEnd().
If you are worried about the time to queue ncclSend/ncclRecv operations (the cost of the for loops, this may become problematic if sendsizes.size() or recvsizes.size() gets very large such as millions), I am assuming that sendsizes.size() <= # of GPUs and # of GPUs may not go that large (and I guess it is a great thing if our code scales to million GPUs everywhere else and this becomes a bottleneck.... but I don't expect that will happen in foreseeable future).
@gpucibot merge |
…icast_sendrecv, gather, gatherv) (#1391) - [x] Update cuGraph to use RAFT::comms_t's newly added device_sendrecv & device_multicast_sendrecv) - [x] Update cuGraph to use RAFT::comms_t's newly added gather & gatherv - [x] Update RAFT git tag once rapidsai/raft#114 (currently merged in 0.18 but is not merged to 0.19) and rapidsai/raft#144 are merged to 0.19 Ready for review but cannot be merged till RAFT PR 114 and 144 are merged to RAFT branch-0.19. Authors: - Seunghwa Kang (@seunghwak) Approvers: - Alex Fender (@afender) URL: #1391