Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strip_delimiters option to read_text #11946

Merged

Conversation

upsj
Copy link
Contributor

@upsj upsj commented Oct 19, 2022

Description

This adds a strip_delimiters post-processing option to read_text. I needed to implement some lightweight striping because a thread-per-row parallelization of the string gather gave pretty bad performance.

For consistency, I also removed the special-case handling of delimiters at the end (previously adding an empty row), to match the read_csv behavior.

Benchmark results:

benchmarks/MULTIBYTE_SPLIT_NVBENCH --axis size_approx[pow2]=30 --axis byte_range_percent=100 --axis T=device --axis delim_size=4

[0] Tesla T4

T strip_delimiters delim_percent size_approx CPU Time Noise Peak Memory Usage Encoded file size
device 0 1 2^30 = 1073741824 178.133 ms 0.36% 3.709 GiB 1014.442 MiB
device 1 1 2^30 = 1073741824 188.328 ms 0.31% 4.690 GiB 1014.442 MiB
device 0 25 2^30 = 1073741824 206.188 ms 0.03% 5.292 GiB 953.075 MiB
device 1 25 2^30 = 1073741824 242.534 ms 0.50% 5.975 GiB 953.075 MiB

Closes #11625

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@upsj upsj added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 19, 2022
@upsj upsj requested a review from a team as a code owner October 19, 2022 13:16
@upsj upsj self-assigned this Oct 19, 2022
@upsj upsj requested a review from a team as a code owner October 19, 2022 13:16
@davidwendt
Copy link
Contributor

... thread-per-row parallelization of the string gather gave pretty bad performance

Can you elaborate on this? I think there is a fast strings gather that may be possible to use here.

@codecov
Copy link

codecov bot commented Oct 19, 2022

Codecov Report

Base: 87.40% // Head: 88.15% // Increases project coverage by +0.74% 🎉

Coverage data is based on head (ca5568a) compared to base (f72c4ce).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11946      +/-   ##
================================================
+ Coverage         87.40%   88.15%   +0.74%     
================================================
  Files               133      133              
  Lines             21833    21995     +162     
================================================
+ Hits              19084    19389     +305     
+ Misses             2749     2606     -143     
Impacted Files Coverage Δ
python/strings_udf/strings_udf/__init__.py 86.27% <0.00%> (-10.61%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 82.20% <0.00%> (-3.35%) ⬇️
python/strings_udf/strings_udf/_typing.py 94.73% <0.00%> (-1.06%) ⬇️
python/cudf/cudf/testing/dataset_generator.py 72.83% <0.00%> (-0.42%) ⬇️
python/dask_cudf/dask_cudf/backends.py 84.90% <0.00%> (-0.37%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.92% <0.00%> (-0.21%) ⬇️
python/cudf/cudf/io/orc.py 92.94% <0.00%> (-0.09%) ⬇️
python/cudf/cudf/__init__.py 90.69% <0.00%> (ø)
python/cudf/cudf/core/udf/_ops.py 100.00% <0.00%> (ø)
... and 23 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@upsj
Copy link
Contributor Author

upsj commented Oct 19, 2022

Can you elaborate on this? I think there is a fast strings gather that may be possible to use here.

When running for_each_n with one thread per row copying the elements excluding the delimiter, I got runtimes around 1s instead of 20 ms with the current solution. NSight Compute profiles showed horrible cache utilization numbers. cudf::strings::detail::gather looks almost right - I need to transform individual strings to strip off a fixed number of characters (except for the last row), not gather a subset of strings from a larger column. Maybe the word gather is a bit misleading. But the underlying kernels look pretty close to what I need.

@davidwendt
Copy link
Contributor

Can you elaborate on this? I think there is a fast strings gather that may be possible to use here.

When running for_each_n with one thread per row copying the elements excluding the delimiter, I got runtimes around 1s instead of 20 ms with the current solution. NSight Compute profiles showed horrible cache utilization numbers. cudf::strings::detail::gather looks almost right - I need to transform individual strings to strip off a fixed number of characters (except for the last row), not gather a subset of strings from a larger column. Maybe the word gather is a bit misleading. But the underlying kernels look pretty close to what I need.

Ok. I was thinking more along the lines of this make_strings_column function that can take a device-span of string_view (or equivalent thrust::pair): https://docs.rapids.ai/api/libcudf/stable/group__column__factories.html#ga993941cbf14270bcea2cc95427996de1

It would just be a matter of building a device-uvector of these to call one of these factory functions which has a highly tuned gather operation for building a strings column from individual strings in device memory.

For reference, both of these factory functions (string_view and thrust_pair) call into

template <typename IndexPairIterator>
std::unique_ptr<column> make_strings_column(IndexPairIterator begin,
IndexPairIterator end,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)

@upsj
Copy link
Contributor Author

upsj commented Oct 19, 2022

@davidwendt thanks for the details, that shaved another 18ms off the runtime for the long string case (at the cost of maybe 20 ms for the short string case, but I'll take the added simplicity :) )

@upsj
Copy link
Contributor Author

upsj commented Oct 24, 2022

rerun tests

rapids-bot bot pushed a commit that referenced this pull request Oct 24, 2022
Simplifies the `cudf::strings::strip` function to use the `cudf::make_strings_column` that accepts an iterator of pairs. This factory has a highly tuned gather implementation for building a strings column from an vector (iterator) of strings in device memory.
This was inspired by the review and work in #11946. This also gives a small improvement in the performance of small columns of large strings and even more improvement in large columns of large-ish strings for strip.
No function has changed just the internal implementation has been simplified.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Tobias Ribizel (https://github.com/upsj)

URL: #11954
@upsj upsj force-pushed the feature/multibyte_split_delimiter_erase branch from eb1be96 to d99bfe5 Compare October 26, 2022 10:00
@upsj upsj added cuIO cuIO issue 4 - Needs cuIO Reviewer and removed 3 - Ready for Review Ready for review by team labels Oct 26, 2022
python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
@upsj upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuIO Reviewer labels Oct 27, 2022
@upsj
Copy link
Contributor Author

upsj commented Oct 27, 2022

rerun tests

@upsj
Copy link
Contributor Author

upsj commented Oct 27, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b4ca894 into rapidsai:branch-22.12 Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] read_text should support removing delimiters from the output
4 participants