make data chunk reader return unique_ptr #9129

cwharris · 2021-08-26T20:41:12Z

Depends on rapidsai/rmm#851, for performance reasons.

There are two parts to this change. First, we remove a workaround for RMM's sync-and-steal behavior which was preventing some work from overlapping. This behavior is significantly improveed in rmm#851. The workaround involved allocating long-lived buffers and reusing them. With this change, we create device_uvectors on-the-fly and return them, which brings us to the second part of the change...

Because the data chunk reader owned the long-lived buffers, it was possible to return device_spans from the get_next_chunk method. Now that the device_uvectors are created on the fly and returned, we need an interface that supports ownership of the data on an implementation basis. Different readers can return different implementations of device_data_chunk via a unique_ptr. Those implementations can be owners of data, or just views.

This PR should merge only after rmm#851, else it will cause performance degradation in multibyte_split (which is the only API to use this reader so far).

…ne ownership of data

vuule

The change is definitely good, this is what the reader should return. I'm just not sure why this wasn't the case initially.

vuule · 2021-08-26T21:06:56Z

cpp/include/cudf/io/text/data_chunk_source_factories.hpp

@@ -65,19 +89,8 @@ class istream_data_chunk_reader : public data_chunk_reader {
    }
  }

-  device_span<char> find_or_create_data(std::size_t size, rmm::cuda_stream_view stream)


Kind of hard to connect the RMM optimization and changes here. Was this a performance hack to avoid repeatedly allocating the chunks?

the sync-and-steal behavior in rmm was stealing too often, preventing portions of work from overlapping. With the changes in rmm#851, we no longer have to work around that behavior.

Updated the PR description to give a better overview of how the changes relate.

codecov · 2021-08-26T22:05:47Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@d29c441). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.10    #9129   +/-   ##
===============================================
  Coverage                ?   10.83%           
===============================================
  Files                   ?      114           
  Lines                   ?    19101           
  Branches                ?        0           
===============================================
  Hits                    ?     2070           
  Misses                  ?    17031           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d29c441...486265c. Read the comment docs.

cwharris · 2021-08-26T23:18:15Z

@vuule it wasn't the case initially because there was no practical use case for the new classes, since ownership of the data was never returned to the caller. Now that there is a practical use case, the design is more obviously appropriate.

cpp/include/cudf/io/text/data_chunk_source.hpp

cwharris · 2021-08-27T17:32:33Z

@gpucibot merge

make data chunk reader to return unique_ptr, allowing impl to determi…

6bb2c9e

…ne ownership of data

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 26, 2021

cwharris requested review from vuule, elstehle and jrhemstad August 26, 2021 20:41

vuule reviewed Aug 26, 2021

View reviewed changes

vuule assigned cwharris Aug 26, 2021

vuule added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Aug 26, 2021

vuule approved these changes Aug 26, 2021

View reviewed changes

vuule added the cuIO cuIO issue label Aug 26, 2021

cwharris changed the title ~~make data chunk reader to return unique_ptr~~ make data chunk reader return unique_ptr Aug 26, 2021

add documentation for device_data_chunk

0503c2a

cwharris marked this pull request as ready for review August 27, 2021 00:10

cwharris requested a review from a team as a code owner August 27, 2021 00:10

elstehle approved these changes Aug 27, 2021

View reviewed changes

cpp/include/cudf/io/text/data_chunk_source.hpp Outdated Show resolved Hide resolved

fix spelling of "guarantee"

486265c

rapids-bot bot merged commit 31b731e into rapidsai:branch-21.10 Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make data chunk reader return unique_ptr #9129

make data chunk reader return unique_ptr #9129

cwharris commented Aug 26, 2021 •

edited

Loading

vuule left a comment

vuule Aug 26, 2021

cwharris Aug 26, 2021

cwharris Aug 26, 2021 •

edited

Loading

codecov bot commented Aug 26, 2021 •

edited

Loading

cwharris commented Aug 26, 2021

cwharris commented Aug 27, 2021

make data chunk reader return unique_ptr #9129

make data chunk reader return unique_ptr #9129

Conversation

cwharris commented Aug 26, 2021 • edited Loading

vuule left a comment

Choose a reason for hiding this comment

vuule Aug 26, 2021

Choose a reason for hiding this comment

cwharris Aug 26, 2021

Choose a reason for hiding this comment

cwharris Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Aug 26, 2021 • edited Loading

Codecov Report

cwharris commented Aug 26, 2021

cwharris commented Aug 27, 2021

cwharris commented Aug 26, 2021 •

edited

Loading

cwharris Aug 26, 2021 •

edited

Loading

codecov bot commented Aug 26, 2021 •

edited

Loading