Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] sub-communicator initialization for 2D partitioning support #1065

Closed
4 tasks done
seunghwak opened this issue Aug 12, 2020 · 5 comments · Fixed by #1196
Closed
4 tasks done

[FEA] sub-communicator initialization for 2D partitioning support #1065

seunghwak opened this issue Aug 12, 2020 · 5 comments · Fixed by #1196
Assignees
Labels
feature request New feature or request
Milestone

Comments

@seunghwak
Copy link
Contributor

seunghwak commented Aug 12, 2020

Is your feature request related to a problem? Please describe.
In 2D partitioning, we align P workers (i.e. GPUs) to P_row * P_column and also partition the graph adjacency matrix in 2D.

Common communication patterns are collectives among the workers in the same row or column. To support this, we need to add sub-communicators for the workers in the same row and the same column.

rapidsai/raft#18 and rapidsai/raft#44 added sub-communicator support to RAFT comms.

cuGraph currently initializes only the global communicator. We need to initialize sub-communicaors as well.

https://github.com/rapidsai/raft/blob/f93dad05574b84d32ebbbd25681d2f9bcd7c0a14/cpp/include/raft/comms/comms.hpp#L95
comm_split (similar to MPI_Comm_split https://www.mpich.org/static/docs/latest/www3/MPI_Comm_split.html) splits the global communicator to sub-communicators.

https://github.com/rapidsai/raft/blob/f93dad05574b84d32ebbbd25681d2f9bcd7c0a14/cpp/include/raft/handle.hpp#L163
https://github.com/rapidsai/raft/blob/f93dad05574b84d32ebbbd25681d2f9bcd7c0a14/cpp/include/raft/handle.hpp#L167
set_subcomm and get_subcomm can be used add sub-communicators to the handle and retrieve the added sub-communicator when necessary.

  • Decide P_row & P_column for the given P. We may set P_row and P_column as close as possible to the square root of P by default with a user option to override the default value.
  • Call comm_split to create sub-communicators (two sub-communicators per worker, one for collectives among the workers in the same row, and the other for collectives among the workers in the same column)
  • Decide the string keys for the row sub-communicator and column sub-communicator.
  • Using the string keys, register the sub-communicators to the RAFT handle (call set_subcomm)
@seunghwak seunghwak added the ? - Needs Triage Need team to review and classify label Aug 12, 2020
@BradReesWork BradReesWork added this to the 0.16 milestone Aug 20, 2020
@BradReesWork BradReesWork added feature request New feature or request and removed ? - Needs Triage Need team to review and classify labels Aug 20, 2020
@aschaffer aschaffer self-assigned this Aug 25, 2020
@aschaffer
Copy link
Collaborator

"Call comm_split to create sub-communicators (two sub-communicators per worker, one for collectives among the workers in the same row, and the other for collectives among the workers in the same column)"

2 sub-comms / worker means a total of: 2P (= 2P_row*P_col) sub-comms;

On the other hand,

"Decide the string keys for the row sub-communicator and column sub-communicator."
Suggests one sub-comm per row and one per column; i.e., a total of (P_row + P_col) sub-comms.

Could we clarify what is intended here?

@seunghwak
Copy link
Contributor Author

"Call comm_split to create sub-communicators (two sub-communicators per worker, one for collectives among the workers in the same row, and the other for collectives among the workers in the same column)"

2 sub-comms / worker means a total of: 2_P (= 2_P_row*P_col) sub-comms;

On the other hand,

"Decide the string keys for the row sub-communicator and column sub-communicator."
Suggests one sub-comm per row and one per column; i.e., a total of (P_row + P_col) sub-comms.

Could we clarify what is intended here?

Oh, yes, you're correct. It's total of 2P sub-communicators. Each process has one global communicator and two sub-communicators (one for row-wise collectives and the second for column-wise collectives).

@seunghwak
Copy link
Contributor Author

seunghwak commented Aug 27, 2020

And

https://github.com/rapidsai/raft/blob/f93dad05574b84d32ebbbd25681d2f9bcd7c0a14/cpp/include/raft/handle.hpp#L163
https://github.com/rapidsai/raft/blob/f93dad05574b84d32ebbbd25681d2f9bcd7c0a14/cpp/include/raft/handle.hpp#L167
set_subcomm and get_subcomm can be used add sub-communicators to the handle and retrieve the added sub-communicator when necessary.

So, set_subcomm and get_subcomm take string keys to retrieve the sub-communicator.

And see here.

https://github.com/rapidsai/cugraph/pull/1098/files#diff-75b979f478be47f71bd6933f474074cbR32

I temporarily set stings keys for row-wise and column-wise sub-communicators here, but not sure this is the best location/names (this is why this has FIXME).

And also note that the string keys for row-wise and column-wise sub-communicators are shared across GPUs.

Did I answer your question?

@rlratzel
Copy link
Contributor

rlratzel commented Oct 1, 2020

Confirmed with @seunghwak that this can be closed now that PR #1124 is merged, and it is successfully being used in PR #1163.

@rlratzel
Copy link
Contributor

rlratzel commented Oct 7, 2020

New PR #1196 now also needs to be closed before this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants