Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: MNMG tests failing on later version of NCCL (>2.11.4.1) #3478

Closed
2 tasks done
jnke2016 opened this issue Apr 11, 2023 · 2 comments
Closed
2 tasks done

[BUG]: MNMG tests failing on later version of NCCL (>2.11.4.1) #3478

jnke2016 opened this issue Apr 11, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jnke2016
Copy link
Contributor

Version

23.04

Which installation method(s) does this occur on?

Docker, Conda, Source

Describe the bug.

All our MNMG runs fail when running on 32+ GPU with nccl>2.11.4.1. This impact all the algos.

Minimum reproducible example

Run any MNMG algo on 32+ GPUs

Relevant log output

No response

Environment details

No response

Other/Misc.

No response

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@jnke2016 jnke2016 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Apr 11, 2023
@rlratzel rlratzel removed the ? - Needs Triage Need team to review and classify label Apr 11, 2023
@rlratzel
Copy link
Contributor

Notes (20230411)

  • still need a minimal reproducer
  • will need a 32+ GPU cluster to reproduce
  • RAFT related

@rlratzel
Copy link
Contributor

5 points remaining for 23.08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants