-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL2 support #105
NCCL2 support #105
Conversation
chainermn/communicators/__init__.py
Outdated
=============== === === ======== ======================================= | ||
|
||
Args: | ||
communicator_name: The name of communicator (``naive``, ``flat``, | ||
``hierarchical``, ``two_dimensional``, or ``single_node``) | ||
``hierarchical``, ``two_dimensional``, ``nccl``, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we treat hierarchical communicator?
It is no longer necessary if we have NCCL2...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need hierarchical communicator if we use NCCL 2. But it is still needed because chainermn supports NCCL 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
intra_size, nccl_comm_id, intra_rank) | ||
return intra_mpi_comm, inter_mpi_comm, intra_nccl_comm | ||
intra_size, intra_nccl_comm_id, intra_rank) | ||
if nccl.get_version() >= 2000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the user specified 'nccl' communicator but only have NCCL 1 ?
Does it detect it and raise and error appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 'nccl' is used with NCCL 1, an exception error occurs in the constructor in NcclCommunicator. So it is able to detect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
chainermn/communicators/__init__.py
Outdated
@@ -17,11 +17,13 @@ def create_communicator( | |||
two_dimensional OK Required Each node has multiple NICs or HCAs | |||
single_node OK Required Single node with multiple GPUs | |||
flat OK N/A | |||
nccl OK Required N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is nccl's recommended use cases "N/A" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: nccl
is recommended when NCCL2 is available in the environment, but it's still experimental support.
chainermn/nccl/chainermn_nccl.h
Outdated
@@ -0,0 +1,98 @@ | |||
// This file is a stub header file of nccl for Read the Docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this line mean? I think we don't compile NCCL-related code at ReadTheDocs, so I think this line is not true.
chainermn/communicators/__init__.py
Outdated
+---------------+---+---+--------+--------------------------------------+ | ||
|flat | |OK | |N/A | | ||
+---------------+---+---+--------+--------------------------------------+ | ||
|nccl | |OK |Required|``nccl`` is recommended when NCCL2 is | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding version specification? (e.g., Required (>= v2)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I added it.
self.gpu_buffer_b.ptr(), n_elems_total, | ||
nccl.NCCL_FLOAT, nccl.NCCL_SUM, | ||
stream.ptr) | ||
stream.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this synchronization is not necessary, as everything (including NCCL communication and following array manipulation) is done in the null stream.
from chainermn import nccl | ||
|
||
|
||
class NcclCommunicator(_base.CommunicatorBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have chainermn.nccl.NcclCommunicator
. This adds new NcclCommunicator
(chainermn.communicators.NcclCommunicator
). I think it is quite confusing. Is it possible to change the name of either one?
LGTM!!!!!! |
No description provided.