[FEA] support StringRepeat #68

revans2 · 2020-05-29T18:13:24Z

Is your feature request related to a problem? Please describe.
We should support the repeat SQL function.

The text was updated successfully, but these errors were encountered:

* gcp jupyternb, rapids,sh, notebook update with model overwrite * notebook updates

revans2 · 2021-02-18T17:56:28Z

If we want to support a scalar value for times then we can do it with just using concatenate. If we want to fully support this we would need some help from cudf.

ttnghia · 2021-05-28T19:46:47Z

@revans2 Can you give some examples please? In particular:

For repeat(string_scalar, n), with string_scalar is invalid
For repeat(strings_column, n) when a string row is a null
For repeat(strings_column, n) when a string row is a valid string but n is 0 or negative
Any other corner/non trivial cases

In addition, we should take into account when repeating a null. Should we always return a null or an empty string? Or that depends on situations?

And one more thing: cudf now has the cudf::repeat API that just does 'repeat' each element multiple times (i.e., cudf::repeat(['a' 'b'], 2) = ['a', 'a', 'b', 'b']). We may want to use a name other than repeat to avoid name conflict. I would suggest using repeat_join or join_repeated (please choose one), as the existing strings API doing string concatenation always have join in their names.

revans2 · 2021-06-01T14:26:05Z

Sure

scala> spark.sql("SELECT repeat('1_', 2) as test").show
+----+
|test|
+----+
|1_1_|
+----+


scala> spark.sql("SELECT repeat('1_', 0) as test").show
+----+
|test|
+----+
|    |
+----+


scala> spark.sql("SELECT repeat('1_', -1) as test").show
+----+
|test|
+----+
|    |
+----+


scala> spark.sql("SELECT repeat('1_', 50) as test").show
+--------------------+
|                test|
+--------------------+
|1_1_1_1_1_1_1_1_1...|
+--------------------+


scala> spark.sql("SELECT repeat('1_', null) as test").show
+----+
|test|
+----+
|null|
+----+


scala> spark.sql("SELECT repeat(null, null) as test").show
+----+
|test|
+----+
|null|
+----+


scala> spark.sql("SELECT repeat(null, 10) as test").show
+----+
|test|
+----+
|null|
+----+

From the Spark code for StringRepeat it implements nullSafeEval which means that if any of the inputs are a null the output will also be a null. Spark does not care if the value input is a scalar value vs a column it treats them the same.

If the number of times it is repeated is <= 0 then we get an empty string back.

    if (times <= 0) {
      return EMPTY_UTF8;
    }

The only other corner case that I would call out is that the number of times something is repeated is an integer. A 32-bit value is also what we use for indexes into strings in cudf so a single character string with an Int.MaxValue repeat in theory works, but we could only support a single row in a batch, which is not good. As such I don't know if it makes a lot of since for cudf to want to support a string repeat like operator with that large of a range. You might get push back to only support shorts or something like that. Either way we need to be careful because it can become a memory management issue for us.

scala> spark.sql("SELECT repeat('1', 100000000) as test").show
21/06/01 14:23:02 WARN DAGScheduler: Broadcasting large task binary with size 95.8 MiB
+--------------------+
|                test|
+--------------------+
|11111111111111111...|
+--------------------+

This PR implements `strings::repeat_strings` which repeats the given string(s) multiple times. In contrast with the existing API `cudf::repeat` that repeats the rows (copies one row into multiple rows), this new API repeats the string within each row of the given strings column (copies the content of each string multiple times into the output string). For example: ``` strs = ['aa', null, '', 'bbc'] out = repeat_strings(strs, 3) out is ['aaaaaa', null, '', 'bbcbbcbbc'] ``` This implements cudf side API for NVIDIA/spark-rapids#68. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - David Wendt (https://github.com/davidwendt) - Karthikeyan (https://github.com/karthikeyann) - Robert Maynard (https://github.com/robertmaynard) URL: #8423

revans2 · 2021-06-09T14:55:08Z

The cudf implementation was merged in, but my request for proper bounds checking so we don't overflow a string size limit and write over memory we should not was disregarded. So if we are going to use this API the Spark code is going to have to do the checks. For a scalar this should be fairly simple. We need to take the size of the data column and multiply. If it is larger than the max size of a string column, then we need to blow up, at least until we find a way to have a project split batches as needed.

For a non scalar repeat count things get a bit harder because we will have to get the length of each string in bytes, multiply it by the repeat count (with the output as a long instead of an int) and then sum the values up before we do the checking.

ttnghia · 2021-06-09T15:09:20Z

Thanks for reminding me about this. Bounds check for strings column is also very simple, but we have to read from device memory twice with stream sync---that is why our request was disregarded. We just need to read from device memory the first offset (because the strings column can be a sliced column) and the last offset, then the total (current) length can be just computed as (last_offset - first_offset). Bounds check can be easily done by checking if (las_offset - first_offset) <= INT_MAX / repeat_times.

One question: Should we do the bound check at JNI layer or Spark's plugin layer?

revans2 · 2021-06-09T15:29:04Z

I would have Spark do it. The JNI API is supposed to be a thin layer around the underlying library. We should not be adding in things that cudf does not want. Also this is technically a problem for a lot of other string APIs. String concat has the same kind of problems, it is just going to be a bit harder to hit them, than it is with repeat.

sameerz · 2021-07-20T20:53:57Z

Moving to 21.10 as the JNI related PR is in review and we are in burndown for 21.08.

Signed-off-by: spark-rapids automation <[email protected]>

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 29, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Sep 22, 2020

wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020

Kr doc upd 1107 (NVIDIA#68)

335737e

* gcp jupyternb, rapids,sh, notebook update with model overwrite * notebook updates

revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021

ttnghia self-assigned this May 28, 2021

sameerz added this to the June 7 - June 18 milestone Jun 1, 2021

ttnghia mentioned this issue Jun 1, 2021

Implement strings::repeat_strings rapidsai/cudf#8423

Merged

ttnghia mentioned this issue Jun 16, 2021

Support string repeat SQL #2728

Merged

sameerz modified the milestones: June 7 - June 18, June 21 - July 2 Jun 21, 2021

sameerz modified the milestones: June 21 - July 2, July 5 - July 16 Jul 1, 2021

sameerz modified the milestones: July 5 - July 16, July 19 - July 30 Jul 16, 2021

sameerz modified the milestones: July 19 - July 30, Aug 2 - Aug 13 Jul 30, 2021

ttnghia closed this as completed in #2728 Aug 5, 2021

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to fe37c0e (NVIDIA#68)

84750d4

Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] support StringRepeat #68

[FEA] support StringRepeat #68

revans2 commented May 29, 2020

revans2 commented Feb 18, 2021

ttnghia commented May 28, 2021 •

edited

Loading

revans2 commented Jun 1, 2021 •

edited

Loading

revans2 commented Jun 9, 2021

ttnghia commented Jun 9, 2021 •

edited

Loading

revans2 commented Jun 9, 2021

sameerz commented Jul 20, 2021

[FEA] support StringRepeat #68

[FEA] support StringRepeat #68

Comments

revans2 commented May 29, 2020

revans2 commented Feb 18, 2021

ttnghia commented May 28, 2021 • edited Loading

revans2 commented Jun 1, 2021 • edited Loading

revans2 commented Jun 9, 2021

ttnghia commented Jun 9, 2021 • edited Loading

revans2 commented Jun 9, 2021

sameerz commented Jul 20, 2021

ttnghia commented May 28, 2021 •

edited

Loading

revans2 commented Jun 1, 2021 •

edited

Loading

ttnghia commented Jun 9, 2021 •

edited

Loading