Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-GPU hang on graph generation #1572

Merged
merged 10 commits into from
May 6, 2021

Conversation

seunghwak
Copy link
Contributor

Two bug fixes for multi-GPU graph creation.

  • Add barrier to avoid overlap between different communicators
  • NCCL bug workaround on DGX1

@seunghwak seunghwak requested review from a team as code owners May 3, 2021 18:30
@seunghwak seunghwak added 2 - In Progress DO NOT MERGE Hold off on merging; see PR for details bug Something isn't working labels May 3, 2021
@seunghwak
Copy link
Contributor Author

@rlratzel There is a one more bug I am tracking but you can run this on other machines and see how this works (to see the current unresolved issue is DGX1 specific or not).

@codecov-commenter
Copy link

codecov-commenter commented May 3, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.20@8b1004e). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.20    #1572   +/-   ##
==============================================
  Coverage               ?   59.98%           
==============================================
  Files                  ?       77           
  Lines                  ?     3369           
  Branches               ?        0           
==============================================
  Hits                   ?     2021           
  Misses                 ?     1348           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b1004e...8f12235. Read the comment docs.

@seunghwak seunghwak added 3 - Ready for Review non-breaking Non-breaking change and removed 2 - In Progress DO NOT MERGE Hold off on merging; see PR for details labels May 4, 2021
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Nothing jumped out, so LGTM.

@BradReesWork BradReesWork added this to the 21.06 milestone May 5, 2021
@BradReesWork
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e4f58eb into rapidsai:branch-0.20 May 6, 2021
@seunghwak seunghwak deleted the bug_mg_test branch June 24, 2021 19:03
rapids-bot bot pushed a commit that referenced this pull request Jun 29, 2022
## Description

This PR cleans up some `#include`s for Thrust. This is meant to help ease the transition to Thrust 1.17 when that is updated in rapids-cmake.

## Context

I opened a PR rapidsai/cudf#10489 that updates cuDF to Thrust 1.16. Notably, version 1.16 of Thrust reduced the number of internal header inclusions:
> [#1572](NVIDIA/thrust#1572) Removed several unnecessary header includes. Downstream projects may need to update their includes if they were relying on this behavior.

I spoke with @robertmaynard and he recommended making similar changes to clean up includes ("include what we use," in essence) to make sure we have compatibility with future versions of Thrust across all RAPIDS libraries.

This changeset also makes it more obvious where cugraph depends on `thrust/detail` headers.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Brad Rees (https://github.com/BradReesWork)
  - Seunghwa Kang (https://github.com/seunghwak)

URL: #2310
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants