Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL2 support #105

Merged
merged 18 commits into from
Aug 28, 2017
Merged

NCCL2 support #105

merged 18 commits into from
Aug 28, 2017

Conversation

shu65
Copy link
Member

@shu65 shu65 commented Aug 10, 2017

No description provided.

@shu65 shu65 changed the title [WIP] NCCL2 support NCCL2 support Aug 15, 2017
@shu65 shu65 mentioned this pull request Aug 15, 2017
=============== === === ======== =======================================

Args:
communicator_name: The name of communicator (``naive``, ``flat``,
``hierarchical``, ``two_dimensional``, or ``single_node``)
``hierarchical``, ``two_dimensional``, ``nccl``, or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we treat hierarchical communicator?
It is no longer necessary if we have NCCL2...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need hierarchical communicator if we use NCCL 2. But it is still needed because chainermn supports NCCL 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

intra_size, nccl_comm_id, intra_rank)
return intra_mpi_comm, inter_mpi_comm, intra_nccl_comm
intra_size, intra_nccl_comm_id, intra_rank)
if nccl.get_version() >= 2000:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user specified 'nccl' communicator but only have NCCL 1 ?
Does it detect it and raise and error appropriately?

Copy link
Member Author

@shu65 shu65 Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 'nccl' is used with NCCL 1, an exception error occurs in the constructor in NcclCommunicator. So it is able to detect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

@iwiwi iwiwi added the feature label Aug 15, 2017
@iwiwi iwiwi self-requested a review August 16, 2017 05:16
@shu65 shu65 added this to the v1.0.0 milestone Aug 16, 2017
@@ -17,11 +17,13 @@ def create_communicator(
two_dimensional OK Required Each node has multiple NICs or HCAs
single_node OK Required Single node with multiple GPUs
flat OK N/A
nccl OK Required N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is nccl's recommended use cases "N/A" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: nccl is recommended when NCCL2 is available in the environment, but it's still experimental support.

@@ -0,0 +1,98 @@
// This file is a stub header file of nccl for Read the Docs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line mean? I think we don't compile NCCL-related code at ReadTheDocs, so I think this line is not true.

+---------------+---+---+--------+--------------------------------------+
|flat | |OK | |N/A |
+---------------+---+---+--------+--------------------------------------+
|nccl | |OK |Required|``nccl`` is recommended when NCCL2 is |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding version specification? (e.g., Required (>= v2))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I added it.

self.gpu_buffer_b.ptr(), n_elems_total,
nccl.NCCL_FLOAT, nccl.NCCL_SUM,
stream.ptr)
stream.synchronize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this synchronization is not necessary, as everything (including NCCL communication and following array manipulation) is done in the null stream.

from chainermn import nccl


class NcclCommunicator(_base.CommunicatorBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have chainermn.nccl.NcclCommunicator. This adds new NcclCommunicator (chainermn.communicators.NcclCommunicator). I think it is quite confusing. Is it possible to change the name of either one?

@iwiwi
Copy link
Contributor

iwiwi commented Aug 28, 2017

LGTM!!!!!!

@keisukefukuda keisukefukuda merged commit a690761 into master Aug 28, 2017
@keisukefukuda keisukefukuda deleted the nccl2 branch August 28, 2017 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants