Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add format API for list column of strings #9454

Merged
merged 29 commits into from
Nov 9, 2021

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Oct 15, 2021

Closes #8351

This PR adds API cudf::strings::format_list_column to create the formatted output as described in #8351. The API only accepts lists columns of strings.

Example 1
  l1 = { [[a,b,c], [d,e]], [[f,g], [h]] }
  s1 = format_list_column(l1)
  s1 is now ["[[a,b,c],[d,e]]", "[[f,g],[h]]"]

Example 2
  l2 = { [[a,b,c], [d,e]], [NULL], [[f,g], NULL [h]] }
  s2 = format_list_column(l1, '-', [':', '{', '}'])
  s2 is now ["{{a:b:c}:{d:e}}", "{-}", "{{f:g}:-:{h}}"]

The format API takes parameters to specify the strings to use for [ , ] and ',' as well as the string used to represent null entries.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Oct 15, 2021
@davidwendt davidwendt self-assigned this Oct 15, 2021
@github-actions github-actions bot added the CMake CMake build issue label Oct 15, 2021
@codecov
Copy link

codecov bot commented Oct 15, 2021

Codecov Report

Merging #9454 (bfb4118) into branch-21.12 (ab4bfaa) will decrease coverage by 0.15%.
The diff coverage is n/a.

❗ Current head bfb4118 differs from pull request most recent head e9dbe8d. Consider uploading reports for the commit e9dbe8d to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9454      +/-   ##
================================================
- Coverage         10.79%   10.63%   -0.16%     
================================================
  Files               116      117       +1     
  Lines             18869    19356     +487     
================================================
+ Hits               2036     2059      +23     
- Misses            16833    17297     +464     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/reshape.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f041a47...e9dbe8d. Read the comment docs.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 15, 2021
@github-actions github-actions bot added the conda label Oct 19, 2021
@davidwendt davidwendt marked this pull request as ready for review October 20, 2021 19:17
@davidwendt davidwendt requested review from a team as code owners October 20, 2021 19:17
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

Comment on lines +60 to +61
strings_column_view const& separators = strings_column_view(column_view{
data_type{type_id::STRING}, 0, nullptr}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer a column rather than three separate scalars? Admittedly that adds more parameters to the API, but it seems awkward to stuff all three in a column (especially since I would anticipate that overloading the element separator would be a largely independent request from overriding the enclosures).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little more efficient to create a column of strings which normally is a single device copy rather than individual scalars which would be 3 small device copies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but it feels like a (minor) abuse of a column to stuff in three values that are semantically different but happen to be of the same type. The elements of a column seem like they should all "mean" the same thing, if that makes sense. This feels like a premature optimization, but H2D copies are expensive so maybe the improvement is worth it. I trust your judgment there, just felt odd to me.

cpp/src/strings/convert/convert_lists.cu Show resolved Hide resolved
cpp/src/strings/convert/convert_lists.cu Show resolved Hide resolved
cpp/src/strings/convert/convert_lists.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_lists.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_lists.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_lists.cu Show resolved Hide resolved
@davidwendt davidwendt requested a review from vyasr October 29, 2021 11:55
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few open questions, but not much clearly actionable and mostly just to satisfy my curiosity. I think we can ship it.

cpp/src/strings/convert/convert_lists.cu Show resolved Hide resolved
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍


TEST_F(StringsFormatListsTest, EmptyList)
{
using STR_LISTS = cudf::test::lists_column_wrapper<cudf::string_view>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repeated line could be moved to the top.

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 499ebae into rapidsai:branch-21.12 Nov 9, 2021
@davidwendt davidwendt deleted the fea-strings-format-lists branch November 10, 2021 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support conversion of list columns to strings
6 participants