-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add format API for list column of strings #9454
Add format API for list column of strings #9454
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9454 +/- ##
================================================
- Coverage 10.79% 10.63% -0.16%
================================================
Files 116 117 +1
Lines 18869 19356 +487
================================================
+ Hits 2036 2059 +23
- Misses 16833 17297 +464
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
strings_column_view const& separators = strings_column_view(column_view{ | ||
data_type{type_id::STRING}, 0, nullptr}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer a column rather than three separate scalars? Admittedly that adds more parameters to the API, but it seems awkward to stuff all three in a column (especially since I would anticipate that overloading the element separator would be a largely independent request from overriding the enclosures).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little more efficient to create a column of strings which normally is a single device copy rather than individual scalars which would be 3 small device copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but it feels like a (minor) abuse of a column to stuff in three values that are semantically different but happen to be of the same type. The elements of a column seem like they should all "mean" the same thing, if that makes sense. This feels like a premature optimization, but H2D copies are expensive so maybe the improvement is worth it. I trust your judgment there, just felt odd to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few open questions, but not much clearly actionable and mostly just to satisfy my curiosity. I think we can ship it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
|
||
TEST_F(StringsFormatListsTest, EmptyList) | ||
{ | ||
using STR_LISTS = cudf::test::lists_column_wrapper<cudf::string_view>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repeated line could be moved to the top.
@gpucibot merge |
Closes #8351
This PR adds API
cudf::strings::format_list_column
to create the formatted output as described in #8351. The API only accepts lists columns of strings.The format API takes parameters to specify the strings to use for
[
,]
and ',' as well as the string used to represent null entries.