Skip to content

Commit

Permalink
[Bugfix] Add stream synchronization before the scatter operation (#73)
Browse files Browse the repository at this point in the history
This is to address the issue from this PR: rapidsai/wholegraph#229,  and it's only for the last scatter operation before the Python interface (not for all internal `scatter_func` calls)

Since the output of the scatter operation could be on the host (e.g., when emb_device = 'cpu'), it is necessary to perform synchronization internally. This ensures users do not need to explicitly synchronize the compute stream before accessing the host memory.

Unlike the gather operation, where the output is always in device memory, host side synchronization is unnecessary.

Authors:
  - Chang Liu (https://github.com/chang-l)
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - https://github.com/linhu-nv

URL: #73
  • Loading branch information
chang-l authored Dec 2, 2024
1 parent 4807986 commit 466b5b9
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions cpp/src/wholememory_ops/scatter_op_impl_mapped.cu
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
#include <wholememory/env_func_ptrs.h>
#include <wholememory/wholememory.h>

#include "cuda_macros.hpp"
#include "wholememory_ops/functions/gather_scatter_func.h"

namespace wholememory_ops {
Expand All @@ -41,6 +42,7 @@ wholememory_error_code_t wholememory_scatter_mapped(
wholememory_desc,
stream,
scatter_sms);
WM_CUDA_CHECK(cudaStreamSynchronize(stream));
}

} // namespace wholememory_ops

0 comments on commit 466b5b9

Please sign in to comment.