-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] sub-communicator initialization for 2D partitioning support #1065
Comments
2 sub-comms / worker means a total of: 2P (= 2P_row*P_col) sub-comms; On the other hand,
Could we clarify what is intended here? |
Oh, yes, you're correct. It's total of 2P sub-communicators. Each process has one global communicator and two sub-communicators (one for row-wise collectives and the second for column-wise collectives). |
And
So, set_subcomm and get_subcomm take string keys to retrieve the sub-communicator. And see here. https://github.com/rapidsai/cugraph/pull/1098/files#diff-75b979f478be47f71bd6933f474074cbR32 I temporarily set stings keys for row-wise and column-wise sub-communicators here, but not sure this is the best location/names (this is why this has FIXME). And also note that the string keys for row-wise and column-wise sub-communicators are shared across GPUs. Did I answer your question? |
Confirmed with @seunghwak that this can be closed now that PR #1124 is merged, and it is successfully being used in PR #1163. |
New PR #1196 now also needs to be closed before this can be closed. |
Is your feature request related to a problem? Please describe.
In 2D partitioning, we align P workers (i.e. GPUs) to P_row * P_column and also partition the graph adjacency matrix in 2D.
Common communication patterns are collectives among the workers in the same row or column. To support this, we need to add sub-communicators for the workers in the same row and the same column.
rapidsai/raft#18 and rapidsai/raft#44 added sub-communicator support to RAFT comms.
cuGraph currently initializes only the global communicator. We need to initialize sub-communicaors as well.
https://github.com/rapidsai/raft/blob/f93dad05574b84d32ebbbd25681d2f9bcd7c0a14/cpp/include/raft/comms/comms.hpp#L95
comm_split
(similar to MPI_Comm_split https://www.mpich.org/static/docs/latest/www3/MPI_Comm_split.html) splits the global communicator to sub-communicators.https://github.com/rapidsai/raft/blob/f93dad05574b84d32ebbbd25681d2f9bcd7c0a14/cpp/include/raft/handle.hpp#L163
https://github.com/rapidsai/raft/blob/f93dad05574b84d32ebbbd25681d2f9bcd7c0a14/cpp/include/raft/handle.hpp#L167
set_subcomm
andget_subcomm
can be used add sub-communicators to the handle and retrieve the added sub-communicator when necessary.The text was updated successfully, but these errors were encountered: