[BUG] cudf::to_arrow is not properly synchronized #6706

harrism · 2020-11-09T00:49:19Z

cudf::to_arrow copies data from device to host on a stream and never synchronizes that stream before making it available to the Arrow library, which does not use the same stream. Therefore I believe there is a potential race condition.

This is an example of one of the functions in to_arrow.cpp that copies data asynchronously to the host.

cudf/cpp/src/interop/to_arrow.cpp

Lines 39 to 58 in 3d44ed5

    
           template <typename T> 
        
           std::shared_ptr<arrow::Buffer> fetch_data_buffer(column_view input_view, 
        
                                                            arrow::MemoryPool* ar_mr, 
        
                                                            cudaStream_t stream) 
        
           { 
        
             const int64_t data_size_in_bytes = sizeof(T) * input_view.size(); 
        
             auto result = arrow::AllocateBuffer(data_size_in_bytes, ar_mr); 
        
             CUDF_EXPECTS(result.ok(), "Failed to allocate Arrow buffer for data"); 
        
             std::shared_ptr<arrow::Buffer> data_buffer = std::move(result.ValueOrDie()); 
        
             CUDA_TRY(cudaMemcpyAsync(data_buffer->mutable_data(), 
        
                                      input_view.data<T>(), 
        
                                      data_size_in_bytes, 
        
                                      cudaMemcpyDeviceToHost, 
        
                                      stream)); 
        
             return data_buffer; 
        
           }

The to_arrow function ultimately calls this function and others like it, but never synchronizes before returning. Therefore an application could use this function to get an Arrow array/buffer, and then access that array on the host without synchronizing the stream.

I believe currently pageable host memory is used, which would cause cudaMemcpyAsync to return only once the copy is completed. But if this is switched to pinned host memory, it will be asynchronous, and we will have a race condition.

The text was updated successfully, but these errors were encountered:

harrism · 2020-11-09T00:51:48Z

I believe the same is true of to_dlpack

Converting libcudf to use rmm::cuda_stream_view will require a LOT of changes, so I'm splitting it into multiple PRs to ease reviewing. This is the second PR in the series. This series of PRs will Replace usage of cudaStream_t with rmm::cuda_stream_view Replace usage of 0 or nullptr as a stream identifier with rmm::cuda_stream_default Ensure all APIs always order the stream parameter before the memory resource parameter. #5119 Contributes to #6645 and #5119 Depends on #6646 so this PR will look much bigger until that one is merged. Also fixes #6706 (to_arrow and to_dlpack are not properly synchronized). This second PR converts: table.hpp (and source / dependencies) column_factories.hpp (and source / dependencies) headers in include/detail/aggregation (and source / dependencies) include/detail/groupby/sort_helper.hpp (and source / dependencies) include/detail/utilities/cuda.cuh (and dependencies) binary ops concatenate copy_if copy_range fill gather get_value hash groupby quantiles reductions repeat replace reshape round scatter search sequence sorting stream compaction

harrism added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. labels Nov 9, 2020

harrism self-assigned this Nov 9, 2020

harrism mentioned this issue Nov 9, 2020

Replace raw streams with rmm::cuda_stream_view (part 2) #6648

Merged

harrism closed this as completed in #6648 Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudf::to_arrow is not properly synchronized #6706

[BUG] cudf::to_arrow is not properly synchronized #6706

harrism commented Nov 9, 2020

harrism commented Nov 9, 2020

[BUG] cudf::to_arrow is not properly synchronized #6706

[BUG] cudf::to_arrow is not properly synchronized #6706

Comments

harrism commented Nov 9, 2020

harrism commented Nov 9, 2020