-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] compute substrings from beginning until delimiter or from a delimiter until end of string #5303
Conversation
[RELEASE] cudf v0.12
…er until end of string - this Closes rapidsai#5158 - this emulates spark's `substring_index` function
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
This should not be against branch-0.14. |
Seems odd that |
[sc] does all new features from now on go against 0.15? the reason i'm asking is because i have been creating them against 0.14 thus far. is there a cut off date for 0.14 (after which nothing other than bug fixes get there)?
[sc] i have reused the function name from spark. should i rename this to |
https://docs.rapids.ai/releases/process/ Here are the current dates: https://docs.rapids.ai/maintainers Once burndown starts, we generally don't accept new PRs unless they are urgent. You just need to retarget this at 0.15 (click the edit button next to the PR title). |
This is very similar to cudf::strings::split
You may be able to use
Example:
The Looking at the overall behavior, I think overloading |
thanks for the references to the fwiw, i did a small test to create a million strings (from the last test in this pr) and forward searched for a string scalar using both the we could add more flags to ignore creating those additional columns if needed. but, wouldn't this clutter the api? |
Is it slower because it is building 2 extra columns? Or perhaps
Just seems there is a possibility of code re-use with some existing APIs. Also, I guess I'm having trouble with the name. There is There is also a set of |
[sc] i did not profile it, but my suspicion was also more along the lines of building those extra columns for this use-case.
[sc] i wasn't happy with it either and simply reused what spark had in the absence of a better name to elicit discussions
[sc] i'll look @ the slice api's to see if i can reuse some of its implementation. would renaming these api's also as
|
- reuse some of the facility `slice_strings` already has to build the substrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! The test coverage looks perfect :)
Some (mostly minor) suggestions
…to substring_index
substring_index
function