-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Make NCCL root initialization configurable. #120
Conversation
Rerun tests |
Tests currently failing in CI appear to be due to : rapidsai/ucx-py#668 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Just a few minor things to consolidate some of the logic and ease the maintenance burden a little bit.
Rerun tests |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@drobison00 this is a breaking change, right? Then cugraph will also need a similar PR to rapidsai/cuml#3386 to upgrade their tagged commit cc @afender @BradReesWork for vis |
@dantegd added a PR to reflect the changes: rapidsai/cugraph#1343 Once the raft update is merged, I can update git tags as well. |
This PR enables host input arrays for `ivf_pq::build` and `ivf_pq::extend`. closes rapidsai#120 closes rapidsai#143 Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai/cuvs#148
This updates the raft comms implementation to support configurable NCCL root placement, updates scheduler and worker routine logging so that it is visible and usable in a multi-node deployment, and adds additional error handling to the NCCL initialization process to fail early if errors are encountered.
Updates resolve MNMG failures in out of band cuML algorithms where the DASK client does not have the ability to directly communicate with workers.
Unit tests are implemented and passing.
Closes rapidsai/cuml#3261