Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Use the Correct WG Communicator #4548

Merged

Conversation

alexbarghi-nv
Copy link
Member

@alexbarghi-nv alexbarghi-nv commented Jul 22, 2024

cuGraph-PyG's WholeFeatureStore currently uses the local communicator, when it should be using the global communicator, as was originally intended. This PR modifies the feature store so it correctly calls get_global_node_communicator().

This also fixes another bug where torch.int32 was used to store the number of edges in the graph, which resulted in an overflow error when the number of edges exceeded that datatype's maximum value. The datatype is now correctly set to int64.

@alexbarghi-nv alexbarghi-nv self-assigned this Jul 22, 2024
@alexbarghi-nv alexbarghi-nv added the bug Something isn't working label Jul 22, 2024
@alexbarghi-nv alexbarghi-nv added non-breaking Non-breaking change and removed python labels Jul 22, 2024
@alexbarghi-nv alexbarghi-nv added this to the 24.08 milestone Jul 22, 2024
@alexbarghi-nv alexbarghi-nv marked this pull request as ready for review July 22, 2024 16:23
@alexbarghi-nv alexbarghi-nv requested a review from a team as a code owner July 22, 2024 16:23
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (disclaimer: I'm not familiar with the behavior, side-effects, and intent, so feel free to also pull in others if necessary)

@alexbarghi-nv
Copy link
Member Author

LGTM (disclaimer: I'm not familiar with the behavior, side-effects, and intent, so feel free to also pull in others if necessary)

The tl;dr is that the local communicator is used for intra-node communication and the global communicator is used for inter-node communication. If there's only one node, they are the same communicator, basically. But if there are multiple nodes, then the local communicator won't include all workers. So we could potentially hang if we use the local communicator here. There usually isn't any reason to use the local communicator in this context unless we're doing something that we know only involves the workers on the current node.

@alexbarghi-nv
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit ac35be3 into rapidsai:branch-24.08 Jul 30, 2024
130 of 131 checks passed
@alexbarghi-nv alexbarghi-nv deleted the use-correct-communicator branch July 30, 2024 14:20
BradReesWork added a commit to rapidsai/cugraph-gnn that referenced this pull request Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants