[Bugfix] Add stream synchronization before the scatter operation #73

chang-l · 2024-11-19T00:09:59Z

This is to address the issue from this PR: rapidsai/wholegraph#229, and it's only for the last scatter operation before the Python interface (not for all internal scatter_func calls)

Since the output of the scatter operation could be on the host (e.g., when emb_device = 'cpu'), it is necessary to perform synchronization internally. This ensures users do not need to explicitly synchronize the compute stream before accessing the host memory.

Unlike the gather operation, where the output is always in device memory, host side synchronization is unnecessary.

copy-pr-bot · 2024-11-19T00:10:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

linhu-nv

looks good to me.

alexbarghi-nv · 2024-11-19T21:07:21Z

/ok to test

@linhu-nv

Migrated from: rapidsai/wholegraph#229 This PR is to add gather/scatter support 1D tensor on python level, as WholeGraph should support basic indexing operations for both 1D (array) and 2D (matrix) wholememory tensors. Without this PR, if with 1D wholememory tensor, gather/scatter op does not work, e.g., https://github.com/rapidsai/wholegraph/blob/0efba33835d6e4e104b5d7101a91e0ea55a6ca53/python/pylibwholegraph/pylibwholegraph/torch/tensor.py#L89 To test, run ``` pytest --cache-clear --import-mode=append tests/wholegraph_torch/ops/test_wholegraph_gather_scatter.py -s ``` **Remaining issue:** On my local test with single GPU, the test can pass. For multiGPU setup, gather op works fine, but 1D scatter seems not working as it would crash at: https://github.com/rapidsai/wholegraph/blob/2e963b98aa6027c300d60e839010d3dd8ca422eb/python/pylibwholegraph/pylibwholegraph/tests/wholegraph_torch/ops/test_wholegraph_gather_scatter.py#L108 with incorrect scatter outputs: `Indices where allclose fails: tensor([0., 0., 0., ..., 0., 0., 0.]) tensor([ 1435., 1439., 1443., ..., 257703., 257707., 257711.]) ` This would work if this bugfix is merged: #73 cc. @linhu-nv Authors: - Chang Liu (https://github.com/chang-l) Approvers: - https://github.com/linhu-nv - Alex Barghi (https://github.com/alexbarghi-nv) URL: #74

alexbarghi-nv · 2024-11-22T16:40:27Z

/ok to test

alexbarghi-nv · 2024-12-02T17:13:11Z

/merge

chang-l requested a review from a team as a code owner November 19, 2024 00:10

chang-l changed the title ~~[Bugfix] Add stream synchronization before the scatter operation (only for the last scatter operation before the Python interface).~~ [Bugfix] Add stream synchronization before the scatter operation Nov 19, 2024

Sync stream for scatter_op

8ba6cdf

chang-l force-pushed the sync-scatter-stream branch from 097434c to 8ba6cdf Compare November 19, 2024 00:13

chang-l mentioned this pull request Nov 19, 2024

[Feature] Add gather/scatter support 1D tensor #74

Merged

linhu-nv approved these changes Nov 19, 2024

View reviewed changes

linhu-nv added bug Something isn't working non-breaking Introduces a non-breaking change labels Nov 19, 2024

Merge branch 'branch-24.12' into sync-scatter-stream

d017e22

rapids-bot bot merged commit 466b5b9 into rapidsai:branch-24.12 Dec 2, 2024
79 of 82 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Add stream synchronization before the scatter operation #73

[Bugfix] Add stream synchronization before the scatter operation #73

chang-l commented Nov 19, 2024 •

edited

Loading

copy-pr-bot bot commented Nov 19, 2024

linhu-nv left a comment

alexbarghi-nv commented Nov 19, 2024

alexbarghi-nv commented Nov 22, 2024

alexbarghi-nv commented Dec 2, 2024

[Bugfix] Add stream synchronization before the scatter operation #73

[Bugfix] Add stream synchronization before the scatter operation #73

Conversation

chang-l commented Nov 19, 2024 • edited Loading

copy-pr-bot bot commented Nov 19, 2024

linhu-nv left a comment

Choose a reason for hiding this comment

alexbarghi-nv commented Nov 19, 2024

alexbarghi-nv commented Nov 22, 2024

alexbarghi-nv commented Dec 2, 2024

chang-l commented Nov 19, 2024 •

edited

Loading