Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Change strings::split_record to return a lists column #5687

Merged
merged 13 commits into from
Jul 21, 2020

Conversation

davidwendt
Copy link
Contributor

Reference #5667

The cudf::strings::contiguous_split_record API returns a single memory buffer plus a vector of column_views similar to the cudf::contiguous_split function. Now that we have LIST type columns, the strings API has been repurposed in this PR to return a lists column instead. The lists column child will be a flat strings column and the list offsets identify each input string's tokens.

Since the new APIs no longer matches the contiguous_split result, this PR also changes the names to simply cudf::strings::split_record and cudf::strings::rsplit_record. All appropriate gtests have been updated accordingly.

The code change was involved because the original layout of characters and offsets were interleaved in the output memory. The new result creates one large strings column of all the tokens along with the appropriate offsets for the lists column.

This API is not currently exposed in the cudf Python interface but could be used in the str.split() functions with expand=False

This PR does not include adding regex support to any of the existing split() APIs.

@davidwendt davidwendt self-assigned this Jul 14, 2020
@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Jul 14, 2020
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

1 similar comment
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov
Copy link

codecov bot commented Jul 14, 2020

Codecov Report

Merging #5687 into branch-0.15 will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.15    #5687   +/-   ##
============================================
  Coverage        86.38%   86.38%           
============================================
  Files               76       76           
  Lines            13041    13041           
============================================
  Hits             11265    11265           
  Misses            1776     1776           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce826c5...ca7212d. Read the comment docs.

@davidwendt davidwendt marked this pull request as ready for review July 14, 2020 20:36
@davidwendt davidwendt requested a review from a team as a code owner July 14, 2020 20:36
@davidwendt davidwendt changed the title [WIP] Change strings::split_record to return a List column [REVIEW] Change strings::split_record to return a lists column Jul 14, 2020
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 14, 2020
@harrism
Copy link
Member

harrism commented Jul 19, 2020

Does this close #5667, or just "reference" it?

@davidwendt
Copy link
Contributor Author

Does this close #5667, or just "reference" it?

This PR includes only the libcudf C++ code for this API and not any of the Python/Cython bindings to it.
The Lists support for Python cudf is ongoing so the reference issue will likely be closed in a future PR.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit that I saw. Excited to start using this when it is merged in.

cpp/tests/strings/split_tests.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@revans2
Copy link
Contributor

revans2 commented Jul 20, 2020

Looks good to me too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants