-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow casting from UDFString
back to StringView
to call methods in strings_udf
#12363
Allow casting from UDFString
back to StringView
to call methods in strings_udf
#12363
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.04 #12363 +/- ##
===============================================
Coverage ? 85.83%
===============================================
Files ? 159
Lines ? 25196
Branches ? 0
===============================================
Hits ? 21627
Misses ? 3569
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor requests, but looks fine overall.
|
||
|
||
def sv_to_udf_str(sv): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function is already defined here so I'm not sure why it's also needed in the main codebase. Maybe that can just be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also confused here about the requirement/use of this function, and the lowering. It seems (from the tests) like it must be user-facing functionality: since the string-udf is written by an external entity, how did they previously construct UDF strings (rather than string_view strings)? And so, if converting between the two is necessary, surely this should be a function in the toolkit of UDF authors.
If not, how is that UDF authors can write something that gets passed a string view but expects a UDF string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question. It happens as follows. All strings in strings_udf
start life as a string_view
, because the input column isn't being mutated. column_to_string_view_array
handles this. It stays that way all the way until it hits the users UDF. Imagine they have something like this:
def f(some_string: string_view):
return some_string.startswith('a')
Following the typing in this case, some_string
is known to be of type string_view
. When startswith
is called, numba searches for a declaration of StringView.startswith(StringView)
and finds the one that returns a bool
. The lowering to actually do the work is inserted there and everything is fine.
Now, suppose the user has something like
def f(some_string: string_view):
new_string = some_string.upper()
return new_string.startswith('a')
Again some_string
starts life as an instance of StringView
, but that's where the similarities end. In this case, numba needs to figure out what type new_string
is. Since functions that return strings need to be built around a udf_string
, that's what upper()
does. Numba will find the overload of StringView.upper() -> udf_string
and determine that the type of new_string
is now udf_string
.
This is where we hit the problem because there is no method UDFString.startswith
. In the c++ layer there isn't even a startswith
method associated with udf_string
objects because they're considered to be a read-only operation, just like there's no upper()
method of cudf::string_view
.
What this PR is trying to do is turn one into the other when necessary. When a string_view
tries to call a udf_string
method, it needs to be cast, and the same way the other way around. So while users don't ever explicitly do any work with udf_string
or string_view
themselves, depending on what they're doing in their function their string could be either of them at any time and still needs to always have access to both sets of methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a nice explanation and clarifies things a lot. Can we put it in some documentation somewhere?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandon-b-miller could still use this documentation (unless I missed it somewhere during my review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am confused here about the need for this code, and particularly which (if any) parts need to be exposed to authors of UDFs. It seems like none, but then how did we previously get in a situation where string views needed to be converted to udf strings (or vice versa)?
id = cuda.grid(1) | ||
if id < size: | ||
st = input_strings[id] | ||
st = sv_to_udf_str(st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of don't understand how this could work, this sv_to_udf_str
is imported from tests.utils
, but most of the registration that happens in this PR works on sv_to_udf_str
defined in _typing
. How are these matched up? I hope very much that numba doesn't just use the unqualified string name of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should only be one of these now and should never have been two (likely a copy paste error). I'm not sure how numba resolves two functions that have the same name, I don't think it's by the name, my guess is that since both copies of it didn't do anything except pass
that their bytecode or whatever numba uses to do its lookup was similar enough that it worked.
In any case, there's only one now and it's defined in _testing
.
|
||
|
||
def sv_to_udf_str(sv): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also confused here about the requirement/use of this function, and the lowering. It seems (from the tests) like it must be user-facing functionality: since the string-udf is written by an external entity, how did they previously construct UDF strings (rather than string_view strings)? And so, if converting between the two is necessary, surely this should be a function in the toolkit of UDF authors.
If not, how is that UDF authors can write something that gets passed a string view but expects a UDF string?
Sorry all for the noise from the branch change of base here! |
This has been updated with the changes from #12669 and should be ready to go again. The code is largely the same as when we last looked at it, the difference mainly being the shuffling around of files. |
bool_binary_funcs = ["startswith", "endswith"] | ||
int_binary_funcs = ["find", "rfind"] | ||
id_unary_funcs = [ | ||
"isalpha", | ||
"isalnum", | ||
"isdecimal", | ||
"isdigit", | ||
"isupper", | ||
"islower", | ||
"isspace", | ||
"isnumeric", | ||
"istitle", | ||
] | ||
string_unary_funcs = ["upper", "lower"] | ||
string_return_attrs = ["strip", "lstrip", "rstrip"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, these don't seem to be used at all anymore. A few questions:
- Can these be deleted now?
- Was there a reason to switch from the old monkey-patching approach to the new approach other than clarity, i.e. does it impact whether things work for both UDFString and StringView instead of just one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up reverting this because it was mostly a refactor I forgot to undo. I'll clean this up in a separate PR.
|
||
|
||
def get_kernel(func, dtype, size): | ||
def get_kernels(func, dtype, size): | ||
""" | ||
Create a kernel for testing a single scalar string function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring needs updating to reflect that you create two separate kernels now, one for each type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
|
||
def sv_to_udf_str(sv): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandon-b-miller could still use this documentation (unless I missed it somewhere during my review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the new docstring is quite helpful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/merge
/merge |
This PR adds some code to cast a
UDFString
to aStringView
which unblocks UDFs that end up calling further transformations on strings that have already been returned by other functions. It works by registering a set of attributes toUDFString
instances that mirror the ones attached toStringView
, and introducing lowering that allows a cast. The cast ultimately calls a shim function which wraps thecudf::string_view
casting operator ofudf_string
.