-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add strip_delimiters
option to read_text
#11946
Add strip_delimiters
option to read_text
#11946
Conversation
Can you elaborate on this? I think there is a fast strings gather that may be possible to use here. |
Codecov ReportBase: 87.40% // Head: 88.15% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11946 +/- ##
================================================
+ Coverage 87.40% 88.15% +0.74%
================================================
Files 133 133
Lines 21833 21995 +162
================================================
+ Hits 19084 19389 +305
+ Misses 2749 2606 -143
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
When running |
Ok. I was thinking more along the lines of this It would just be a matter of building a device-uvector of these to call one of these factory functions which has a highly tuned gather operation for building a strings column from individual strings in device memory. For reference, both of these factory functions ( cudf/cpp/include/cudf/strings/detail/strings_column_factories.cuh Lines 72 to 76 in 6ca2ceb
|
@davidwendt thanks for the details, that shaved another 18ms off the runtime for the long string case (at the cost of maybe 20 ms for the short string case, but I'll take the added simplicity :) ) |
rerun tests |
Simplifies the `cudf::strings::strip` function to use the `cudf::make_strings_column` that accepts an iterator of pairs. This factory has a highly tuned gather implementation for building a strings column from an vector (iterator) of strings in device memory. This was inspired by the review and work in #11946. This also gives a small improvement in the performance of small columns of large strings and even more improvement in large columns of large-ish strings for strip. No function has changed just the internal implementation has been simplified. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Tobias Ribizel (https://github.com/upsj) URL: #11954
eb1be96
to
d99bfe5
Compare
Co-authored-by: Bradley Dice <[email protected]>
rerun tests |
@gpucibot merge |
Description
This adds a
strip_delimiters
post-processing option toread_text
. I needed to implement some lightweight striping because a thread-per-row parallelization of the string gather gave pretty bad performance.For consistency, I also removed the special-case handling of delimiters at the end (previously adding an empty row), to match the read_csv behavior.
Benchmark results:
[0] Tesla T4
Closes #11625
Checklist