[BUG] Use the Correct WG Communicator #4548

alexbarghi-nv · 2024-07-22T16:22:58Z

cuGraph-PyG's WholeFeatureStore currently uses the local communicator, when it should be using the global communicator, as was originally intended. This PR modifies the feature store so it correctly calls get_global_node_communicator().

This also fixes another bug where torch.int32 was used to store the number of edges in the graph, which resulted in an overflow error when the number of edges exceeded that datatype's maximum value. The datatype is now correctly set to int64.

rlratzel

LGTM (disclaimer: I'm not familiar with the behavior, side-effects, and intent, so feel free to also pull in others if necessary)

alexbarghi-nv · 2024-07-22T16:58:04Z

LGTM (disclaimer: I'm not familiar with the behavior, side-effects, and intent, so feel free to also pull in others if necessary)

The tl;dr is that the local communicator is used for intra-node communication and the global communicator is used for inter-node communication. If there's only one node, they are the same communicator, basically. But if there are multiple nodes, then the local communicator won't include all workers. So we could potentially hang if we use the local communicator here. There usually isn't any reason to use the local communicator in this context unless we're doing something that we know only involves the workers on the current node.

…ghi-nv/cugraph into use-correct-communicator

alexbarghi-nv · 2024-07-30T14:20:37Z

/merge

[BUG] Use the Correct WG Communicator (rapidsai/cugraph#4548)

use global communicator

89f4ef4

alexbarghi-nv self-assigned this Jul 22, 2024

alexbarghi-nv added the bug Something isn't working label Jul 22, 2024

github-actions bot added the python label Jul 22, 2024

alexbarghi-nv added non-breaking Non-breaking change and removed python labels Jul 22, 2024

alexbarghi-nv added this to the 24.08 milestone Jul 22, 2024

alexbarghi-nv marked this pull request as ready for review July 22, 2024 16:23

alexbarghi-nv requested a review from a team as a code owner July 22, 2024 16:23

rlratzel approved these changes Jul 22, 2024

View reviewed changes

global

4d82ee0

github-actions bot added the python label Jul 22, 2024

alexbarghi-nv and others added 5 commits July 23, 2024 11:19

Merge branch 'branch-24.08' into use-correct-communicator

b4ed827

Merge branch 'branch-24.08' into use-correct-communicator

e144ad1

use int64 to store # edges

2b160bf

Merge branch 'use-correct-communicator' of https://github.com/alexbar…

22b85d2

…ghi-nv/cugraph into use-correct-communicator

Merge branch 'branch-24.08' into use-correct-communicator

a77499a

rapids-bot bot merged commit ac35be3 into rapidsai:branch-24.08 Jul 30, 2024
130 of 131 checks passed

alexbarghi-nv deleted the use-correct-communicator branch July 30, 2024 14:20

alexbarghi-nv mentioned this pull request Jul 30, 2024

[BUG] Use the Correct WG Communicator (rapidsai/cugraph#4548) rapidsai/cugraph-gnn#20

Merged

BradReesWork added a commit to rapidsai/cugraph-gnn that referenced this pull request Aug 1, 2024

Merge pull request #20 from alexbarghi-nv/correct-wg-comm

961fd04

[BUG] Use the Correct WG Communicator (rapidsai/cugraph#4548)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Use the Correct WG Communicator #4548

[BUG] Use the Correct WG Communicator #4548

alexbarghi-nv commented Jul 22, 2024 •

edited

Loading

rlratzel left a comment

alexbarghi-nv commented Jul 22, 2024

alexbarghi-nv commented Jul 30, 2024

[BUG] Use the Correct WG Communicator #4548

[BUG] Use the Correct WG Communicator #4548

Conversation

alexbarghi-nv commented Jul 22, 2024 • edited Loading

rlratzel left a comment

Choose a reason for hiding this comment

alexbarghi-nv commented Jul 22, 2024

alexbarghi-nv commented Jul 30, 2024

alexbarghi-nv commented Jul 22, 2024 •

edited

Loading