Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make data chunk reader return unique_ptr #9129

Merged
merged 3 commits into from
Aug 27, 2021

Conversation

cwharris
Copy link
Contributor

@cwharris cwharris commented Aug 26, 2021

Depends on rapidsai/rmm#851, for performance reasons.

There are two parts to this change. First, we remove a workaround for RMM's sync-and-steal behavior which was preventing some work from overlapping. This behavior is significantly improveed in rmm#851. The workaround involved allocating long-lived buffers and reusing them. With this change, we create device_uvectors on-the-fly and return them, which brings us to the second part of the change...

Because the data chunk reader owned the long-lived buffers, it was possible to return device_spans from the get_next_chunk method. Now that the device_uvectors are created on the fly and returned, we need an interface that supports ownership of the data on an implementation basis. Different readers can return different implementations of device_data_chunk via a unique_ptr. Those implementations can be owners of data, or just views.

This PR should merge only after rmm#851, else it will cause performance degradation in multibyte_split (which is the only API to use this reader so far).

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 26, 2021
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is definitely good, this is what the reader should return. I'm just not sure why this wasn't the case initially.

@@ -65,19 +89,8 @@ class istream_data_chunk_reader : public data_chunk_reader {
}
}

device_span<char> find_or_create_data(std::size_t size, rmm::cuda_stream_view stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of hard to connect the RMM optimization and changes here. Was this a performance hack to avoid repeatedly allocating the chunks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sync-and-steal behavior in rmm was stealing too often, preventing portions of work from overlapping. With the changes in rmm#851, we no longer have to work around that behavior.

Copy link
Contributor Author

@cwharris cwharris Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR description to give a better overview of how the changes relate.

@codecov
Copy link

codecov bot commented Aug 26, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@d29c441). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9129   +/-   ##
===============================================
  Coverage                ?   10.83%           
===============================================
  Files                   ?      114           
  Lines                   ?    19101           
  Branches                ?        0           
===============================================
  Hits                    ?     2070           
  Misses                  ?    17031           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d29c441...486265c. Read the comment docs.

@vuule vuule added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Aug 26, 2021
@vuule vuule added the cuIO cuIO issue label Aug 26, 2021
@cwharris
Copy link
Contributor Author

@vuule it wasn't the case initially because there was no practical use case for the new classes, since ownership of the data was never returned to the caller. Now that there is a practical use case, the design is more obviously appropriate.

@cwharris cwharris changed the title make data chunk reader to return unique_ptr make data chunk reader return unique_ptr Aug 26, 2021
@cwharris cwharris marked this pull request as ready for review August 27, 2021 00:10
@cwharris cwharris requested a review from a team as a code owner August 27, 2021 00:10
cpp/include/cudf/io/text/data_chunk_source.hpp Outdated Show resolved Hide resolved
@cwharris
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 31b731e into rapidsai:branch-21.10 Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants