Add `strip_delimiters` option to `read_text` #11946

upsj · 2022-10-19T13:16:15Z

Description

This adds a strip_delimiters post-processing option to read_text. I needed to implement some lightweight striping because a thread-per-row parallelization of the string gather gave pretty bad performance.

For consistency, I also removed the special-case handling of delimiters at the end (previously adding an empty row), to match the read_csv behavior.

Benchmark results:

benchmarks/MULTIBYTE_SPLIT_NVBENCH --axis size_approx[pow2]=30 --axis byte_range_percent=100 --axis T=device --axis delim_size=4

[0] Tesla T4

T	strip_delimiters	delim_percent	size_approx	CPU Time	Noise	Peak Memory Usage	Encoded file size
device	0	1	2^30 = 1073741824	178.133 ms	0.36%	3.709 GiB	1014.442 MiB
device	1	1	2^30 = 1073741824	188.328 ms	0.31%	4.690 GiB	1014.442 MiB
device	0	25	2^30 = 1073741824	206.188 ms	0.03%	5.292 GiB	953.075 MiB
device	1	25	2^30 = 1073741824	242.534 ms	0.50%	5.975 GiB	953.075 MiB

Closes #11625

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt · 2022-10-19T14:57:09Z

... thread-per-row parallelization of the string gather gave pretty bad performance

Can you elaborate on this? I think there is a fast strings gather that may be possible to use here.

codecov · 2022-10-19T15:36:42Z

Codecov Report

Base: 87.40% // Head: 88.15% // Increases project coverage by +0.74% 🎉

Coverage data is based on head (ca5568a) compared to base (f72c4ce).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #11946      +/-   ##
================================================
+ Coverage         87.40%   88.15%   +0.74%     
================================================
  Files               133      133              
  Lines             21833    21995     +162     
================================================
+ Hits              19084    19389     +305     
+ Misses             2749     2606     -143

Impacted Files	Coverage Δ
python/strings_udf/strings_udf/__init__.py	`86.27% <0.00%> (-10.61%)`	⬇️
python/cudf/cudf/io/text.py	`91.66% <0.00%> (-8.34%)`	⬇️
python/cudf/cudf/core/_base_index.py	`82.20% <0.00%> (-3.35%)`	⬇️
python/strings_udf/strings_udf/_typing.py	`94.73% <0.00%> (-1.06%)`	⬇️
python/cudf/cudf/testing/dataset_generator.py	`72.83% <0.00%> (-0.42%)`	⬇️
python/dask_cudf/dask_cudf/backends.py	`84.90% <0.00%> (-0.37%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.92% <0.00%> (-0.21%)`	⬇️
python/cudf/cudf/io/orc.py	`92.94% <0.00%> (-0.09%)`	⬇️
python/cudf/cudf/__init__.py	`90.69% <0.00%> (ø)`
python/cudf/cudf/core/udf/_ops.py	`100.00% <0.00%> (ø)`
... and 23 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

upsj · 2022-10-19T16:45:26Z

Can you elaborate on this? I think there is a fast strings gather that may be possible to use here.

When running for_each_n with one thread per row copying the elements excluding the delimiter, I got runtimes around 1s instead of 20 ms with the current solution. NSight Compute profiles showed horrible cache utilization numbers. cudf::strings::detail::gather looks almost right - I need to transform individual strings to strip off a fixed number of characters (except for the last row), not gather a subset of strings from a larger column. Maybe the word gather is a bit misleading. But the underlying kernels look pretty close to what I need.

davidwendt · 2022-10-19T17:06:53Z

Can you elaborate on this? I think there is a fast strings gather that may be possible to use here.

When running for_each_n with one thread per row copying the elements excluding the delimiter, I got runtimes around 1s instead of 20 ms with the current solution. NSight Compute profiles showed horrible cache utilization numbers. cudf::strings::detail::gather looks almost right - I need to transform individual strings to strip off a fixed number of characters (except for the last row), not gather a subset of strings from a larger column. Maybe the word gather is a bit misleading. But the underlying kernels look pretty close to what I need.

Ok. I was thinking more along the lines of this make_strings_column function that can take a device-span of string_view (or equivalent thrust::pair): https://docs.rapids.ai/api/libcudf/stable/group__column__factories.html#ga993941cbf14270bcea2cc95427996de1

It would just be a matter of building a device-uvector of these to call one of these factory functions which has a highly tuned gather operation for building a strings column from individual strings in device memory.

For reference, both of these factory functions (string_view and thrust_pair) call into

cudf/cpp/include/cudf/strings/detail/strings_column_factories.cuh

Lines 72 to 76 in 6ca2ceb

    
           template <typename IndexPairIterator> 
        
           std::unique_ptr<column> make_strings_column(IndexPairIterator begin, 
        
                                                       IndexPairIterator end, 
        
                                                       rmm::cuda_stream_view stream, 
        
                                                       rmm::mr::device_memory_resource* mr)

upsj · 2022-10-19T18:25:31Z

@davidwendt thanks for the details, that shaved another 18ms off the runtime for the long string case (at the cost of maybe 20 ms for the short string case, but I'll take the added simplicity :) )

upsj · 2022-10-24T09:40:25Z

rerun tests

python/cudf/cudf/io/text.py

Simplifies the `cudf::strings::strip` function to use the `cudf::make_strings_column` that accepts an iterator of pairs. This factory has a highly tuned gather implementation for building a strings column from an vector (iterator) of strings in device memory. This was inspired by the review and work in #11946. This also gives a small improvement in the performance of small columns of large strings and even more improvement in large columns of large-ish strings for strip. No function has changed just the internal implementation has been simplified. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Tobias Ribizel (https://github.com/upsj) URL: #11954

cpp/include/cudf/io/text/multibyte_split.hpp

Co-authored-by: Bradley Dice <[email protected]>

python/cudf/cudf/utils/ioutils.py

upsj · 2022-10-27T15:45:58Z

rerun tests

upsj · 2022-10-27T17:27:09Z

@gpucibot merge

upsj added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 19, 2022

upsj requested a review from a team as a code owner October 19, 2022 13:16

upsj self-assigned this Oct 19, 2022

upsj requested a review from a team as a code owner October 19, 2022 13:16

upsj requested review from bdice, galipremsagar and nvdbaranec October 19, 2022 13:16

davidwendt mentioned this pull request Oct 20, 2022

Use gather-based strings factory in cudf::strings::strip #11954

Merged

3 tasks

galipremsagar reviewed Oct 24, 2022

View reviewed changes

python/cudf/cudf/io/text.py Show resolved Hide resolved

upsj added 7 commits October 25, 2022 15:46

add strip_delimiters option to read_text

0b345d4

use gathering make_string_column

3e4d464

restore original handling of delimiter in last row

84a975a

fix byte range matching last delimiter at end

7c30283

strip empty row with delimiter at end

d7c3afc

fix OOB access when no delimiter was found

8ae5707

fix python test

d99bfe5

upsj force-pushed the feature/multibyte_split_delimiter_erase branch from eb1be96 to d99bfe5 Compare October 26, 2022 10:00

ad missing docstring

01ba6f1

davidwendt reviewed Oct 26, 2022

View reviewed changes

cpp/include/cudf/io/text/multibyte_split.hpp Outdated Show resolved Hide resolved

davidwendt approved these changes Oct 26, 2022

View reviewed changes

upsj added cuIO cuIO issue 4 - Needs cuIO Reviewer and removed 3 - Ready for Review Ready for review by team labels Oct 26, 2022

galipremsagar approved these changes Oct 26, 2022

View reviewed changes

Reword parse_options docstring

6c2a3b0

Co-authored-by: Bradley Dice <[email protected]>

bdice requested changes Oct 26, 2022

View reviewed changes

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

bdice approved these changes Oct 27, 2022

View reviewed changes

Extend docstring

ca5568a

upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuIO Reviewer labels Oct 27, 2022

rapids-bot bot merged commit b4ca894 into rapidsai:branch-22.12 Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `strip_delimiters` option to `read_text` #11946

Add `strip_delimiters` option to `read_text` #11946

upsj commented Oct 19, 2022 •

edited

Loading

davidwendt commented Oct 19, 2022

codecov bot commented Oct 19, 2022 •

edited

Loading

upsj commented Oct 19, 2022

davidwendt commented Oct 19, 2022

upsj commented Oct 19, 2022

upsj commented Oct 24, 2022

upsj commented Oct 27, 2022

upsj commented Oct 27, 2022

Add strip_delimiters option to read_text #11946

Add strip_delimiters option to read_text #11946

Conversation

upsj commented Oct 19, 2022 • edited Loading

Description

[0] Tesla T4

Checklist

davidwendt commented Oct 19, 2022

codecov bot commented Oct 19, 2022 • edited Loading

Codecov Report

upsj commented Oct 19, 2022

davidwendt commented Oct 19, 2022

upsj commented Oct 19, 2022

upsj commented Oct 24, 2022

upsj commented Oct 27, 2022

upsj commented Oct 27, 2022

Add `strip_delimiters` option to `read_text` #11946

Add `strip_delimiters` option to `read_text` #11946

upsj commented Oct 19, 2022 •

edited

Loading

codecov bot commented Oct 19, 2022 •

edited

Loading