[REVIEW] commSplit Implementation #18

cjnolet · 2020-06-09T21:56:47Z

A simple commSplit implementation with associated end-to-end pytest that creates subcommunicators and performs separate allreduces on the subcommunicators.

cpp/include/raft/comms/std_comms.hpp

cpp/include/raft/comms/ucp_helper.hpp

cpp/include/raft/comms/std_comms.hpp

afender · 2020-06-16T20:33:32Z

cpp/include/raft/comms/std_comms.hpp

+    }
+
+    // non-root ranks of new comm recv unique id
+    irecv(&id.internal, 128, root, color, requests.data() + request_idx);


Another hard-coded 128 here

cjnolet · 2020-07-07T13:32:07Z

@BradReesWork this PR led to some good discussions on slack, but unfortunately until we reach some sort of agreement on the path forward, this PR is blocked in the meantime.

seunghwak · 2020-07-24T20:33:41Z

We'd better start more in-depth discussions on coming Monday, but I think we can start thinking about handle use cases (in particular, with sub-communicators).

There are few questions we can try to answer first.

We can mainly think of two different cases:

There can be at most only one concurrently running primitive (so we need only one handle per OS process).
There are multiple concurrently running primitives (multiple handles per process are necessary).

primitive: any C++ function taking a handle as an input argument.

If we're running MG primitives or large SG primitives (large enough to use up all the single-GPU resources), I think we just need to consider case 1.

If we're running multiple small SG primitives (too small for the capacity of a single GPU, so we need to run multiple primitives concurrently to fully use up the GPU resources), we need multiple handles but in this case, no need for communicators (there can be some pathological cases, e.g. if there are two primitives one requiring very large memory but very little computing and one requiring very little memory but very large computing, in this case, we can get some speedup running two MG primitives concurrently... but let's ignore this kind of seemingly very rare cases).

So, if we can assume there is only one handle with communicator(s) per process (but there can be multiple handles without any communicator), I think we can simplify our discussions (if there is only one handle, cuGraph/cuML can create necessary (sub-)communicators on initialization, and we may not need to worry much or creating an excessive number of sub-communicators, cuGraph will create at most two sub-communicators).

Another side question is, if we agree on one communicator handle per process, will there be one communicator handle per RAPIDS, or one communicator handle per cuML and another communicator handle per cuGraph? (so actually two communicator handles per process, I remember cuML is planning to create a child class for raft::handle_t, so we may not be able to do one communicator handle per RAPIDS).

Another side question is, should we limit sub-communicators for only collectives? Regarding P2P, only one global communicator will be sufficient. Under our current implementation, this means sub-communicator == wrapper for NCCL sub-communicator (this may change if we adopt something like UCC https://www.ucfconsortium.org/projects/ucc/ for host collectives, UCC is collectives on UCX).

I will post more on coming Monday.

cjnolet · 2020-07-25T01:49:47Z

Regarding one handle per process: unfortunately we currently make no such constraint in cuml, either in sg nor mg. We currently create a new handle for each estimator, even in mg. It would definitely be nice to see more reuse of the handle across different estimator instances, though.

Regarding NCCL-only subcommunicator: P2P is optional on the communicator and when P2P is enabled, the UCX endpoints are created once in UCX-py and reused. Do we gain anything by limiting the sub-communicator to NCCL-only if the global communicator was already initialized with p2p? The endpoints are just pointers that get pushed down to the new communicator.

If the concern here is related to performance, for example, from creating the array with the subset of UCX endpoints, another option might be to provide a setting on the communicator creation that enables or disables the UCX endpoints from being injected into the subcommuncicator (and performs an assert when they are called). This seems like the most straightforward solution to me- it maintains the same interface and doesn't increase the confusion factor from unexpected behavior (comparing to MPI). What do you think?

seunghwak · 2020-07-27T14:18:36Z

Regarding one handle per process: unfortunately we currently make no such constraint in cuml, either in sg nor mg. We currently create a new handle for each estimator, even in mg. It would definitely be nice to see more reuse of the handle across different estimator instances, though.

If there is no fundamental reasons we cannot adopt the one full handle (handle with communicator) per process approach, we can design sub-communicator support assuming this. If I am not mistaken, this will not affect the current cuML pipelines as cuML is not using sub-communicator. For cuGraph (cuGraph already adopts the one full handle per process approach), if we have an option to add sub-communicators to the full handle, we can work on 2D partitioning for the coming releases.

Regarding NCCL-only subcommunicator: P2P is optional on the communicator and when P2P is enabled, the UCX endpoints are created once in UCX-py and reused. Do we gain anything by limiting the sub-communicator to NCCL-only if the global communicator was already initialized with p2p? The endpoints are just pointers that get pushed down to the new communicator.

I don't know :-) This is mainly about simplicity. Not sure this really applies to our specific case, but in general, sharing increases software complexity. I guess sharing endpoints between the full communicator and sub-communicators increase complexity. While there is no harm in adding P2P support for sub-communicators, there is no need. If no need, I am inclined to a simpler approach. I assume a sub-communicator supporting only collectives will be simper than a sub-communicator supporting both collectives & P2P communications (with P2P with a sub-communicator, should we use a global rank or use a rank within a sub-communicator? Any synchronization is necessary if multiple communicators use the same endpoints? we don't need to think about this if sub-communicators are collectives only while we can do all the necessary P2P communication using the global communicator).

If the concern here is related to performance, for example, from creating the array with the subset of UCX endpoints, another option might be to provide a setting on the communicator creation that enables or disables the UCX endpoints from being injected into the subcommuncicator (and performs an assert when they are called). This seems like the most straightforward solution to me- it maintains the same interface and doesn't increase the confusion factor from unexpected behavior (comparing to MPI). What do you think?

This is mainly about simplicity. If we don't need P2P with sub-communicators, why should we support? I will dig into the interface part, but if what you have suggested leads to simpler design, we should go for that. Just don't want to add software development/maintenance burden for unnecessary features. cuGraph just needs a means to add sub-communicator for collectives and use them when necessary with the one full handle per process approach.

seunghwak · 2020-07-27T16:03:15Z

Related to

https://github.com/rapidsai/raft/pull/18/files#diff-0835dc79c2c9697632a30a73323b374fR218

std::unique_ptr<comms_iface> comm_split(int color, int key) const

Yeah... I think we can go with 1) the current implementation, 2) skip P2P for sub-communicator, or 3) add a flag to include/exclude P2P support. All three sounds like a legitimate choice at this moment.

And we can add something like set_subcomms() and get_subcomms() to raft::handle_t. set_subcomms() take a string key and the comm_split output and get_subcomms() takes a string key to find the matching subcommunicator. cuGraph can use predefined keys (e.g. cugraph-row and cugraph-column). Later, if cuGraph & cuML use the same communicator, we can coordinate to use common keys to maximize reuse.

Any concerns or any other issues to be discussed? cuGraph needs this to implement 2D partitioning of the graph adjacency matrix and let me know if there is anything else I can do to move this PR forward.

Conflicts: CHANGELOG.md cpp/include/raft/comms/std_comms.hpp

cjnolet · 2020-07-28T22:29:29Z

@seunghwak, This should be ready for another (hopefully final) review. I've added a new argument subcomms_ucp_ to the constructor of std_comms.hpp that accepts UCP endpoints to enable or disable the propagation of the endpoints into subcommunicators. When this option is set to false, the subcommunicators will only create a NCCL-based communicator.

I like your idea regarding the addition of a hashmap to the handle to store instances of subcommunicators. What do you think of doing that in a follow-on PR since this PR has already grown quite large?

seunghwak · 2020-07-29T04:24:05Z

I like your idea regarding the addition of a hashmap to the handle to store instances of subcommunicators. What do you think of doing that in a follow-on PR since this PR has already grown quite large?

Sure, we can do this in additional PR (and I think it's actually better as it is beyond commSplit implementation).

seunghwak

Thank you very much for taking care of this issue, this will be really helpful!!!

cjnolet · 2020-07-29T04:47:18Z

If @afender is happy with these changes then I think they are good to go. I'll try and create a PR for the set_subcomm() tomorrow.

cjnolet added 7 commits June 9, 2020 15:46

Initial comm_split. Need to clean it up and fix failing test

101b7d5

Tests pass. Need to clean up some UCX stuff.

503b59f

Updating changelog

768422f

Full tests and style fixes.

fe217de

Pulling out NCCL id broadcast

27bef52

Supporting non-NCCL inpnuts

266b7b0

Removign some extra spacing

d1b102b

cjnolet changed the title ~~[WIP] commSplit Implementation~~ [REVIEW] commSplit Implementation Jun 10, 2020

seunghwak requested changes Jun 10, 2020

View reviewed changes

BradReesWork added this to the 0.15 milestone Jun 15, 2020

BradReesWork added the 3 - Ready for Review label Jun 15, 2020

afender reviewed Jun 16, 2020

View reviewed changes

BradReesWork assigned cjnolet Jul 7, 2020

cjnolet added Blocked and removed 3 - Ready for Review labels Jul 7, 2020

cjnolet added 2 commits July 28, 2020 18:16

Running clang-format

54fc4dd

Merge branch 'branch-0.15' into fea-015-comms_split

93637b5

Conflicts: CHANGELOG.md cpp/include/raft/comms/std_comms.hpp

Updating format

96d2dc7

cjnolet added 3 - Ready for Review and removed Blocked labels Jul 29, 2020

seunghwak approved these changes Jul 29, 2020

View reviewed changes

afender approved these changes Jul 29, 2020

View reviewed changes

cjnolet merged commit b8c46c9 into rapidsai:branch-0.15 Jul 29, 2020

seunghwak mentioned this pull request Aug 12, 2020

[FEA] sub-communicator initialization for 2D partitioning support rapidsai/cugraph#1065

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] commSplit Implementation #18

[REVIEW] commSplit Implementation #18

cjnolet commented Jun 9, 2020

afender Jun 16, 2020

cjnolet commented Jul 7, 2020

seunghwak commented Jul 24, 2020

cjnolet commented Jul 25, 2020

seunghwak commented Jul 27, 2020

seunghwak commented Jul 27, 2020

cjnolet commented Jul 28, 2020

seunghwak commented Jul 29, 2020

seunghwak left a comment

cjnolet commented Jul 29, 2020

[REVIEW] commSplit Implementation #18

[REVIEW] commSplit Implementation #18

Conversation

cjnolet commented Jun 9, 2020

afender Jun 16, 2020

Choose a reason for hiding this comment

cjnolet commented Jul 7, 2020

seunghwak commented Jul 24, 2020

cjnolet commented Jul 25, 2020

seunghwak commented Jul 27, 2020

seunghwak commented Jul 27, 2020

cjnolet commented Jul 28, 2020

seunghwak commented Jul 29, 2020

seunghwak left a comment

Choose a reason for hiding this comment

cjnolet commented Jul 29, 2020