Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] support StringRepeat #68

Closed
revans2 opened this issue May 29, 2020 · 7 comments · Fixed by #2728
Closed

[FEA] support StringRepeat #68

revans2 opened this issue May 29, 2020 · 7 comments · Fixed by #2728
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented May 29, 2020

Is your feature request related to a problem? Please describe.
We should support the repeat SQL function.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 29, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Sep 22, 2020
wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020
* gcp jupyternb, rapids,sh, notebook update with model overwrite

* notebook updates
@revans2 revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021
@revans2
Copy link
Collaborator Author

revans2 commented Feb 18, 2021

If we want to support a scalar value for times then we can do it with just using concatenate. If we want to fully support this we would need some help from cudf.

@ttnghia ttnghia self-assigned this May 28, 2021
@ttnghia
Copy link
Collaborator

ttnghia commented May 28, 2021

@revans2 Can you give some examples please? In particular:

  • For repeat(string_scalar, n), with string_scalar is invalid
  • For repeat(strings_column, n) when a string row is a null
  • For repeat(strings_column, n) when a string row is a valid string but n is 0 or negative
  • Any other corner/non trivial cases

In addition, we should take into account when repeating a null. Should we always return a null or an empty string? Or that depends on situations?

And one more thing: cudf now has the cudf::repeat API that just does 'repeat' each element multiple times (i.e., cudf::repeat(['a' 'b'], 2) = ['a', 'a', 'b', 'b']). We may want to use a name other than repeat to avoid name conflict. I would suggest using repeat_join or join_repeated (please choose one), as the existing strings API doing string concatenation always have join in their names.

@revans2
Copy link
Collaborator Author

revans2 commented Jun 1, 2021

Sure

scala> spark.sql("SELECT repeat('1_', 2) as test").show
+----+
|test|
+----+
|1_1_|
+----+


scala> spark.sql("SELECT repeat('1_', 0) as test").show
+----+
|test|
+----+
|    |
+----+


scala> spark.sql("SELECT repeat('1_', -1) as test").show
+----+
|test|
+----+
|    |
+----+


scala> spark.sql("SELECT repeat('1_', 50) as test").show
+--------------------+
|                test|
+--------------------+
|1_1_1_1_1_1_1_1_1...|
+--------------------+


scala> spark.sql("SELECT repeat('1_', null) as test").show
+----+
|test|
+----+
|null|
+----+


scala> spark.sql("SELECT repeat(null, null) as test").show
+----+
|test|
+----+
|null|
+----+


scala> spark.sql("SELECT repeat(null, 10) as test").show
+----+
|test|
+----+
|null|
+----+

From the Spark code for StringRepeat it implements nullSafeEval which means that if any of the inputs are a null the output will also be a null. Spark does not care if the value input is a scalar value vs a column it treats them the same.

If the number of times it is repeated is <= 0 then we get an empty string back.

    if (times <= 0) {
      return EMPTY_UTF8;
    }

The only other corner case that I would call out is that the number of times something is repeated is an integer. A 32-bit value is also what we use for indexes into strings in cudf so a single character string with an Int.MaxValue repeat in theory works, but we could only support a single row in a batch, which is not good. As such I don't know if it makes a lot of since for cudf to want to support a string repeat like operator with that large of a range. You might get push back to only support shorts or something like that. Either way we need to be careful because it can become a memory management issue for us.

scala> spark.sql("SELECT repeat('1', 100000000) as test").show
21/06/01 14:23:02 WARN DAGScheduler: Broadcasting large task binary with size 95.8 MiB
+--------------------+
|                test|
+--------------------+
|11111111111111111...|
+--------------------+

@sameerz sameerz added this to the June 7 - June 18 milestone Jun 1, 2021
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Jun 9, 2021
This PR implements `strings::repeat_strings` which repeats the given string(s) multiple times. In contrast with the existing API `cudf::repeat` that repeats the rows (copies one row into multiple rows), this new API repeats the string within each row of the given strings column (copies the content of each string multiple times into the output string). For example:
```
strs = ['aa', null, '',  'bbc']
out  = repeat_strings(strs, 3)
out is ['aaaaaa', null, '',  'bbcbbcbbc']
```

This implements cudf side API for NVIDIA/spark-rapids#68.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - David Wendt (https://github.com/davidwendt)
  - Karthikeyan (https://github.com/karthikeyann)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #8423
@revans2
Copy link
Collaborator Author

revans2 commented Jun 9, 2021

The cudf implementation was merged in, but my request for proper bounds checking so we don't overflow a string size limit and write over memory we should not was disregarded. So if we are going to use this API the Spark code is going to have to do the checks. For a scalar this should be fairly simple. We need to take the size of the data column and multiply. If it is larger than the max size of a string column, then we need to blow up, at least until we find a way to have a project split batches as needed.

For a non scalar repeat count things get a bit harder because we will have to get the length of each string in bytes, multiply it by the repeat count (with the output as a long instead of an int) and then sum the values up before we do the checking.

@ttnghia
Copy link
Collaborator

ttnghia commented Jun 9, 2021

Thanks for reminding me about this. Bounds check for strings column is also very simple, but we have to read from device memory twice with stream sync---that is why our request was disregarded. We just need to read from device memory the first offset (because the strings column can be a sliced column) and the last offset, then the total (current) length can be just computed as (last_offset - first_offset). Bounds check can be easily done by checking if (las_offset - first_offset) <= INT_MAX / repeat_times.

One question: Should we do the bound check at JNI layer or Spark's plugin layer?

@revans2
Copy link
Collaborator Author

revans2 commented Jun 9, 2021

I would have Spark do it. The JNI API is supposed to be a thin layer around the underlying library. We should not be adding in things that cudf does not want. Also this is technically a problem for a lot of other string APIs. String concat has the same kind of problems, it is just going to be a bit harder to hit them, than it is with repeat.

@sameerz
Copy link
Collaborator

sameerz commented Jul 20, 2021

Moving to 21.10 as the JNI related PR is in review and we are in burndown for 21.08.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants