-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] support StringRepeat #68
Comments
* gcp jupyternb, rapids,sh, notebook update with model overwrite * notebook updates
If we want to support a scalar value for |
@revans2 Can you give some examples please? In particular:
In addition, we should take into account when repeating a null. Should we always return a null or an empty string? Or that depends on situations? And one more thing: cudf now has the |
Sure
From the Spark code for If the number of times it is repeated is <= 0 then we get an empty string back.
The only other corner case that I would call out is that the number of
|
This PR implements `strings::repeat_strings` which repeats the given string(s) multiple times. In contrast with the existing API `cudf::repeat` that repeats the rows (copies one row into multiple rows), this new API repeats the string within each row of the given strings column (copies the content of each string multiple times into the output string). For example: ``` strs = ['aa', null, '', 'bbc'] out = repeat_strings(strs, 3) out is ['aaaaaa', null, '', 'bbcbbcbbc'] ``` This implements cudf side API for NVIDIA/spark-rapids#68. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - David Wendt (https://github.com/davidwendt) - Karthikeyan (https://github.com/karthikeyann) - Robert Maynard (https://github.com/robertmaynard) URL: #8423
The cudf implementation was merged in, but my request for proper bounds checking so we don't overflow a string size limit and write over memory we should not was disregarded. So if we are going to use this API the Spark code is going to have to do the checks. For a scalar this should be fairly simple. We need to take the size of the data column and multiply. If it is larger than the max size of a string column, then we need to blow up, at least until we find a way to have a project split batches as needed. For a non scalar repeat count things get a bit harder because we will have to get the length of each string in bytes, multiply it by the repeat count (with the output as a long instead of an int) and then sum the values up before we do the checking. |
Thanks for reminding me about this. Bounds check for strings column is also very simple, but we have to read from device memory twice with stream sync---that is why our request was disregarded. We just need to read from device memory the first offset (because the strings column can be a sliced column) and the last offset, then the total (current) length can be just computed as One question: Should we do the bound check at JNI layer or Spark's plugin layer? |
I would have Spark do it. The JNI API is supposed to be a thin layer around the underlying library. We should not be adding in things that cudf does not want. Also this is technically a problem for a lot of other string APIs. String concat has the same kind of problems, it is just going to be a bit harder to hit them, than it is with repeat. |
Moving to 21.10 as the JNI related PR is in review and we are in burndown for 21.08. |
Signed-off-by: spark-rapids automation <[email protected]>
Is your feature request related to a problem? Please describe.
We should support the
repeat
SQL function.The text was updated successfully, but these errors were encountered: