Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement strings::repeat_strings #8423

Merged
merged 30 commits into from
Jun 9, 2021

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jun 1, 2021

This PR implements strings::repeat_strings which repeats the given string(s) multiple times. In contrast with the existing API cudf::repeat that repeats the rows (copies one row into multiple rows), this new API repeats the string within each row of the given strings column (copies the content of each string multiple times into the output string). For example:

strs = ['aa', null, '',  'bbc']
out  = repeat_strings(strs, 3)
out is ['aaaaaa', null, '',  'bbcbbcbbc']

This implements cudf side API for NVIDIA/spark-rapids#68.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) non-breaking Non-breaking change labels Jun 1, 2021
@ttnghia ttnghia self-assigned this Jun 1, 2021
@ttnghia ttnghia requested review from a team as code owners June 1, 2021 22:09
@ttnghia ttnghia requested review from mythrocks and karthikeyann June 1, 2021 22:09
@github-actions github-actions bot added CMake CMake build issue conda labels Jun 1, 2021
@harrism
Copy link
Member

harrism commented Jun 1, 2021

I think the name repeat_join is too ambiguous, because in libcudf, we use join to mean a very different thing: database join, meaning matching columns across different tables.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 1, 2021

Yes, it was very difficult to choose a name for this API. You are welcomed to suggest a better name for this.

Basically, this API should be named repeat as in Spark. However, we already have cudf::repeat which just does repeating rows (for example, repeat(['a', 'bc'], 2) = ['a', 'a', 'bc', 'bc']. Since in strings namespace we have various join APIs that do similar tasks to this one (such as join_strings('a', 'bc') = 'abc'), I decided to call the new API repeat_join, which means "repeat, then join". I still feel this is not the best name, but I don't have a perfect name yet (another name would be join_repeated but I'm not sure if it's better than this one).

@harrism
Copy link
Member

harrism commented Jun 1, 2021

Strings are the only supported type, right? cudf::repeat_strings

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 1, 2021

Strings are the only supported type, right? cudf::repeat_strings

Yes, this API only supports strings. Should be cudf::strings::, and yes, repeat_strings is definitely better. Thanks Mark!!!

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good.

cpp/src/strings/repeat_strings.cu Outdated Show resolved Hide resolved
cpp/src/strings/repeat_strings.cu Outdated Show resolved Hide resolved
cpp/src/strings/repeat_strings.cu Outdated Show resolved Hide resolved
cpp/src/strings/repeat_strings.cu Outdated Show resolved Hide resolved
cpp/include/cudf/strings/repeat_strings.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/repeat_strings.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/repeat_strings.hpp Outdated Show resolved Hide resolved
cpp/src/strings/repeat_strings.cu Outdated Show resolved Hide resolved
cpp/src/strings/repeat_strings.cu Outdated Show resolved Hide resolved
cpp/src/strings/repeat_strings.cu Outdated Show resolved Hide resolved
@ttnghia ttnghia requested review from davidwendt and harrism June 7, 2021 15:11
cpp/src/strings/repeat_strings.cu Outdated Show resolved Hide resolved
cpp/include/cudf/strings/repeat_strings.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/repeat_strings.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/repeat_strings.hpp Outdated Show resolved Hide resolved
cpp/src/strings/repeat_strings.cu Show resolved Hide resolved
@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 9, 2021

@gpucibot merge

@ttnghia ttnghia requested a review from robertmaynard June 9, 2021 12:17
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to CMake changes

@rapids-bot rapids-bot bot merged commit 8df3c6a into rapidsai:branch-21.08 Jun 9, 2021
@ttnghia ttnghia deleted the strings_repeat branch June 17, 2021 13:23
rapids-bot bot pushed a commit that referenced this pull request Jul 20, 2021
…rent number of times (#8561)

This work is requested from the Spark team, which is also a follow up work on #8423 so that cudf's `strings::repeat_strings` fully supports `StringRepeat` SQL expression in Apache Spark.

Note that this API requires to explicitly implement overflow check for the size of the output strings column, as it is not trivial and can't be performed outside of cudf.

This PR also rewrites some existing code, including renaming variables and changes in doxygen.

### Follow up works depending on this PR:
 * Benchmark: #8589
 * Java binding: #8572

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Jason Lowe (https://github.com/jlowe)
  - Conor Hoekstra (https://github.com/codereport)
  - David Wendt (https://github.com/davidwendt)

URL: #8561
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants