[Bugfix] Add stream synchronization before the scatter operation (#73)

This is to address the issue from this PR: rapidsai/wholegraph#229, and it's only for the last scatter operation before the Python interface (not for all internal `scatter_func` calls) Since the output of the scatter operation could be on the host (e.g., when emb_device = 'cpu'), it is necessary to perform synchronization internally. This ensures users do not need to explicitly synchronize the compute stream before accessing the host memory. Unlike the gather operation, where the output is always in device memory, host side synchronization is unnecessary. Authors: - Chang Liu (https://github.com/chang-l) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - https://github.com/linhu-nv URL: #73
rapidsai · Dec 2, 2024 · 466b5b9 · 466b5b9
1 parent 4807986
commit 466b5b9
Showing 1 changed file with 2 additions and 0 deletions.
diff --git a/cpp/src/wholememory_ops/scatter_op_impl_mapped.cu b/cpp/src/wholememory_ops/scatter_op_impl_mapped.cu
@@ -18,6 +18,7 @@
 #include <wholememory/env_func_ptrs.h>
 #include <wholememory/wholememory.h>
 
+#include "cuda_macros.hpp"
 #include "wholememory_ops/functions/gather_scatter_func.h"
 
 namespace wholememory_ops {
@@ -41,6 +42,7 @@ wholememory_error_code_t wholememory_scatter_mapped(
                       wholememory_desc,
                       stream,
                       scatter_sms);
+  WM_CUDA_CHECK(cudaStreamSynchronize(stream));
 }
 
 }  // namespace wholememory_ops