Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] commSplit Implementation #18

Merged
merged 10 commits into from
Jul 29, 2020

Conversation

cjnolet
Copy link
Member

@cjnolet cjnolet commented Jun 9, 2020

A simple commSplit implementation with associated end-to-end pytest that creates subcommunicators and performs separate allreduces on the subcommunicators.

@cjnolet cjnolet changed the title [WIP] commSplit Implementation [REVIEW] commSplit Implementation Jun 10, 2020
cpp/include/raft/comms/std_comms.hpp Outdated Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Outdated Show resolved Hide resolved
cpp/include/raft/comms/ucp_helper.hpp Show resolved Hide resolved
cpp/include/raft/comms/ucp_helper.hpp Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Show resolved Hide resolved
cpp/include/raft/comms/std_comms.hpp Outdated Show resolved Hide resolved
}

// non-root ranks of new comm recv unique id
irecv(&id.internal, 128, root, color, requests.data() + request_idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another hard-coded 128 here

@cjnolet
Copy link
Member Author

cjnolet commented Jul 7, 2020

@BradReesWork this PR led to some good discussions on slack, but unfortunately until we reach some sort of agreement on the path forward, this PR is blocked in the meantime.

@seunghwak
Copy link
Contributor

We'd better start more in-depth discussions on coming Monday, but I think we can start thinking about handle use cases (in particular, with sub-communicators).

There are few questions we can try to answer first.

We can mainly think of two different cases:

  1. There can be at most only one concurrently running primitive (so we need only one handle per OS process).
  2. There are multiple concurrently running primitives (multiple handles per process are necessary).
  • primitive: any C++ function taking a handle as an input argument.

If we're running MG primitives or large SG primitives (large enough to use up all the single-GPU resources), I think we just need to consider case 1.

If we're running multiple small SG primitives (too small for the capacity of a single GPU, so we need to run multiple primitives concurrently to fully use up the GPU resources), we need multiple handles but in this case, no need for communicators (there can be some pathological cases, e.g. if there are two primitives one requiring very large memory but very little computing and one requiring very little memory but very large computing, in this case, we can get some speedup running two MG primitives concurrently... but let's ignore this kind of seemingly very rare cases).

So, if we can assume there is only one handle with communicator(s) per process (but there can be multiple handles without any communicator), I think we can simplify our discussions (if there is only one handle, cuGraph/cuML can create necessary (sub-)communicators on initialization, and we may not need to worry much or creating an excessive number of sub-communicators, cuGraph will create at most two sub-communicators).

Another side question is, if we agree on one communicator handle per process, will there be one communicator handle per RAPIDS, or one communicator handle per cuML and another communicator handle per cuGraph? (so actually two communicator handles per process, I remember cuML is planning to create a child class for raft::handle_t, so we may not be able to do one communicator handle per RAPIDS).

Another side question is, should we limit sub-communicators for only collectives? Regarding P2P, only one global communicator will be sufficient. Under our current implementation, this means sub-communicator == wrapper for NCCL sub-communicator (this may change if we adopt something like UCC https://www.ucfconsortium.org/projects/ucc/ for host collectives, UCC is collectives on UCX).

I will post more on coming Monday.

@cjnolet
Copy link
Member Author

cjnolet commented Jul 25, 2020

Regarding one handle per process: unfortunately we currently make no such constraint in cuml, either in sg nor mg. We currently create a new handle for each estimator, even in mg. It would definitely be nice to see more reuse of the handle across different estimator instances, though.

Regarding NCCL-only subcommunicator: P2P is optional on the communicator and when P2P is enabled, the UCX endpoints are created once in UCX-py and reused. Do we gain anything by limiting the sub-communicator to NCCL-only if the global communicator was already initialized with p2p? The endpoints are just pointers that get pushed down to the new communicator.

If the concern here is related to performance, for example, from creating the array with the subset of UCX endpoints, another option might be to provide a setting on the communicator creation that enables or disables the UCX endpoints from being injected into the subcommuncicator (and performs an assert when they are called). This seems like the most straightforward solution to me- it maintains the same interface and doesn't increase the confusion factor from unexpected behavior (comparing to MPI). What do you think?

@seunghwak
Copy link
Contributor

Regarding one handle per process: unfortunately we currently make no such constraint in cuml, either in sg nor mg. We currently create a new handle for each estimator, even in mg. It would definitely be nice to see more reuse of the handle across different estimator instances, though.

If there is no fundamental reasons we cannot adopt the one full handle (handle with communicator) per process approach, we can design sub-communicator support assuming this. If I am not mistaken, this will not affect the current cuML pipelines as cuML is not using sub-communicator. For cuGraph (cuGraph already adopts the one full handle per process approach), if we have an option to add sub-communicators to the full handle, we can work on 2D partitioning for the coming releases.

Regarding NCCL-only subcommunicator: P2P is optional on the communicator and when P2P is enabled, the UCX endpoints are created once in UCX-py and reused. Do we gain anything by limiting the sub-communicator to NCCL-only if the global communicator was already initialized with p2p? The endpoints are just pointers that get pushed down to the new communicator.

I don't know :-) This is mainly about simplicity. Not sure this really applies to our specific case, but in general, sharing increases software complexity. I guess sharing endpoints between the full communicator and sub-communicators increase complexity. While there is no harm in adding P2P support for sub-communicators, there is no need. If no need, I am inclined to a simpler approach. I assume a sub-communicator supporting only collectives will be simper than a sub-communicator supporting both collectives & P2P communications (with P2P with a sub-communicator, should we use a global rank or use a rank within a sub-communicator? Any synchronization is necessary if multiple communicators use the same endpoints? we don't need to think about this if sub-communicators are collectives only while we can do all the necessary P2P communication using the global communicator).

If the concern here is related to performance, for example, from creating the array with the subset of UCX endpoints, another option might be to provide a setting on the communicator creation that enables or disables the UCX endpoints from being injected into the subcommuncicator (and performs an assert when they are called). This seems like the most straightforward solution to me- it maintains the same interface and doesn't increase the confusion factor from unexpected behavior (comparing to MPI). What do you think?

This is mainly about simplicity. If we don't need P2P with sub-communicators, why should we support? I will dig into the interface part, but if what you have suggested leads to simpler design, we should go for that. Just don't want to add software development/maintenance burden for unnecessary features. cuGraph just needs a means to add sub-communicator for collectives and use them when necessary with the one full handle per process approach.

@seunghwak
Copy link
Contributor

Related to

https://github.com/rapidsai/raft/pull/18/files#diff-0835dc79c2c9697632a30a73323b374fR218

std::unique_ptr<comms_iface> comm_split(int color, int key) const

Yeah... I think we can go with 1) the current implementation, 2) skip P2P for sub-communicator, or 3) add a flag to include/exclude P2P support. All three sounds like a legitimate choice at this moment.

And we can add something like set_subcomms() and get_subcomms() to raft::handle_t. set_subcomms() take a string key and the comm_split output and get_subcomms() takes a string key to find the matching subcommunicator. cuGraph can use predefined keys (e.g. cugraph-row and cugraph-column). Later, if cuGraph & cuML use the same communicator, we can coordinate to use common keys to maximize reuse.

Any concerns or any other issues to be discussed? cuGraph needs this to implement 2D partitioning of the graph adjacency matrix and let me know if there is anything else I can do to move this PR forward.

cjnolet added 2 commits July 28, 2020 18:16
Conflicts:
	CHANGELOG.md
	cpp/include/raft/comms/std_comms.hpp
@cjnolet
Copy link
Member Author

cjnolet commented Jul 28, 2020

@seunghwak, This should be ready for another (hopefully final) review. I've added a new argument subcomms_ucp_ to the constructor of std_comms.hpp that accepts UCP endpoints to enable or disable the propagation of the endpoints into subcommunicators. When this option is set to false, the subcommunicators will only create a NCCL-based communicator.

I like your idea regarding the addition of a hashmap to the handle to store instances of subcommunicators. What do you think of doing that in a follow-on PR since this PR has already grown quite large?

@seunghwak
Copy link
Contributor

I like your idea regarding the addition of a hashmap to the handle to store instances of subcommunicators. What do you think of doing that in a follow-on PR since this PR has already grown quite large?

Sure, we can do this in additional PR (and I think it's actually better as it is beyond commSplit implementation).

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for taking care of this issue, this will be really helpful!!!

@cjnolet
Copy link
Member Author

cjnolet commented Jul 29, 2020

If @afender is happy with these changes then I think they are good to go. I'll try and create a PR for the set_subcomm() tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants