-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Change strings::split_record to return a lists column #5687
[REVIEW] Change strings::split_record to return a lists column #5687
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
1 similar comment
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.15 #5687 +/- ##
============================================
Coverage 86.38% 86.38%
============================================
Files 76 76
Lines 13041 13041
============================================
Hits 11265 11265
Misses 1776 1776 Continue to review full report at Codecov.
|
Does this close #5667, or just "reference" it? |
This PR includes only the libcudf C++ code for this API and not any of the Python/Cython bindings to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit that I saw. Excited to start using this when it is merged in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Looks good to me too. |
Reference #5667
The
cudf::strings::contiguous_split_record
API returns a single memory buffer plus a vector ofcolumn_view
s similar to thecudf::contiguous_split
function. Now that we have LIST type columns, the strings API has been repurposed in this PR to return a lists column instead. The lists column child will be a flat strings column and the list offsets identify each input string's tokens.Since the new APIs no longer matches the
contiguous_split
result, this PR also changes the names to simplycudf::strings::split_record
andcudf::strings::rsplit_record
. All appropriate gtests have been updated accordingly.The code change was involved because the original layout of characters and offsets were interleaved in the output memory. The new result creates one large strings column of all the tokens along with the appropriate offsets for the lists column.
This API is not currently exposed in the cudf Python interface but could be used in the
str.split()
functions withexpand=False
This PR does not include adding regex support to any of the existing
split()
APIs.