Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Make NCCL root initialization configurable. #120

Merged
merged 16 commits into from
Jan 20, 2021

Conversation

drobison00
Copy link
Contributor

This updates the raft comms implementation to support configurable NCCL root placement, updates scheduler and worker routine logging so that it is visible and usable in a multi-node deployment, and adds additional error handling to the NCCL initialization process to fail early if errors are encountered.

Updates resolve MNMG failures in out of band cuML algorithms where the DASK client does not have the ability to directly communicate with workers.

Unit tests are implemented and passing.

Closes rapidsai/cuml#3261

@drobison00 drobison00 requested a review from cjnolet January 13, 2021 20:53
@drobison00 drobison00 added enhancement New feature or request improvement Improvement / enhancement to an existing function labels Jan 13, 2021
@drobison00 drobison00 requested a review from JohnZed January 13, 2021 21:21
@drobison00
Copy link
Contributor Author

Rerun tests

@drobison00
Copy link
Contributor Author

Tests currently failing in CI appear to be due to : rapidsai/ucx-py#668

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Just a few minor things to consolidate some of the logic and ease the maintenance burden a little bit.

python/raft/dask/common/comms.py Outdated Show resolved Hide resolved
python/raft/dask/common/comms.py Outdated Show resolved Hide resolved
python/raft/dask/common/comms.py Outdated Show resolved Hide resolved
python/raft/dask/common/comms.py Outdated Show resolved Hide resolved
python/raft/test/test_comms.py Outdated Show resolved Hide resolved
@drobison00 drobison00 requested a review from cjnolet January 14, 2021 23:25
@drobison00
Copy link
Contributor Author

Rerun tests

@dantegd
Copy link
Member

dantegd commented Jan 19, 2021

rerun tests

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dantegd dantegd removed the 1 - On Deck To be worked on next label Jan 19, 2021
@dantegd
Copy link
Member

dantegd commented Jan 19, 2021

@drobison00 this is a breaking change, right? Then cugraph will also need a similar PR to rapidsai/cuml#3386 to upgrade their tagged commit

cc @afender @BradReesWork for vis

@drobison00
Copy link
Contributor Author

@dantegd added a PR to reflect the changes: rapidsai/cugraph#1343

Once the raft update is merged, I can update git tags as well.

@dantegd dantegd added the breaking Breaking change label Jan 20, 2021
@dantegd dantegd merged commit 87efed0 into rapidsai:branch-0.18 Jan 20, 2021
dantegd pushed a commit to dantegd/raft that referenced this pull request Jul 23, 2024
This PR enables host input arrays for `ivf_pq::build` and `ivf_pq::extend`.

closes rapidsai#120 
closes rapidsai#143

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai/cuvs#148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review 5 - Ready to Merge breaking Breaking change enhancement New feature or request improvement Improvement / enhancement to an existing function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] MNMG KMeans fit fails with NCCL errors in multi-worker cluster.
3 participants