Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow casting from UDFString back to StringView to call methods in strings_udf #12363

Merged

Conversation

brandon-b-miller
Copy link
Contributor

This PR adds some code to cast a UDFString to a StringView which unblocks UDFs that end up calling further transformations on strings that have already been returned by other functions. It works by registering a set of attributes to UDFString instances that mirror the ones attached to StringView, and introducing lowering that allows a cast. The cast ultimately calls a shim function which wraps the cudf::string_view casting operator of udf_string.

@brandon-b-miller brandon-b-miller added bug Something isn't working numba Numba issue Python Affects Python cuDF API. non-breaking Non-breaking change labels Dec 12, 2022
@brandon-b-miller brandon-b-miller self-assigned this Dec 12, 2022
@codecov
Copy link

codecov bot commented Dec 12, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@ea62e0e). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 0d01928 differs from pull request most recent head e6ae995. Consider uploading reports for the commit e6ae995 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04   #12363   +/-   ##
===============================================
  Coverage                ?   85.83%           
===============================================
  Files                   ?      159           
  Lines                   ?    25196           
  Branches                ?        0           
===============================================
  Hits                    ?    21627           
  Misses                  ?     3569           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@brandon-b-miller brandon-b-miller marked this pull request as ready for review January 5, 2023 14:53
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner January 5, 2023 14:53
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor requests, but looks fine overall.

python/strings_udf/strings_udf/_typing.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/strings_lowering.py Show resolved Hide resolved
python/strings_udf/strings_udf/lowering.py Outdated Show resolved Hide resolved
python/strings_udf/strings_udf/lowering.py Outdated Show resolved Hide resolved


def sv_to_udf_str(sv):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function is already defined here so I'm not sure why it's also needed in the main codebase. Maybe that can just be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also confused here about the requirement/use of this function, and the lowering. It seems (from the tests) like it must be user-facing functionality: since the string-udf is written by an external entity, how did they previously construct UDF strings (rather than string_view strings)? And so, if converting between the two is necessary, surely this should be a function in the toolkit of UDF authors.

If not, how is that UDF authors can write something that gets passed a string view but expects a UDF string?

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question. It happens as follows. All strings in strings_udf start life as a string_view, because the input column isn't being mutated. column_to_string_view_array handles this. It stays that way all the way until it hits the users UDF. Imagine they have something like this:

def f(some_string: string_view):
    return some_string.startswith('a')

Following the typing in this case, some_string is known to be of type string_view. When startswith is called, numba searches for a declaration of StringView.startswith(StringView) and finds the one that returns a bool. The lowering to actually do the work is inserted there and everything is fine.

Now, suppose the user has something like

def f(some_string: string_view):
    new_string = some_string.upper()
    return new_string.startswith('a')

Again some_string starts life as an instance of StringView, but that's where the similarities end. In this case, numba needs to figure out what type new_string is. Since functions that return strings need to be built around a udf_string, that's what upper() does. Numba will find the overload of StringView.upper() -> udf_string and determine that the type of new_string is now udf_string.

This is where we hit the problem because there is no method UDFString.startswith. In the c++ layer there isn't even a startswith method associated with udf_string objects because they're considered to be a read-only operation, just like there's no upper() method of cudf::string_view.

What this PR is trying to do is turn one into the other when necessary. When a string_view tries to call a udf_string method, it needs to be cast, and the same way the other way around. So while users don't ever explicitly do any work with udf_string or string_view themselves, depending on what they're doing in their function their string could be either of them at any time and still needs to always have access to both sets of methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a nice explanation and clarifies things a lot. Can we put it in some documentation somewhere?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandon-b-miller could still use this documentation (unless I missed it somewhere during my review).

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am confused here about the need for this code, and particularly which (if any) parts need to be exposed to authors of UDFs. It seems like none, but then how did we previously get in a situation where string views needed to be converted to udf strings (or vice versa)?

python/strings_udf/strings_udf/tests/test_string_udfs.py Outdated Show resolved Hide resolved
id = cuda.grid(1)
if id < size:
st = input_strings[id]
st = sv_to_udf_str(st)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of don't understand how this could work, this sv_to_udf_str is imported from tests.utils, but most of the registration that happens in this PR works on sv_to_udf_str defined in _typing. How are these matched up? I hope very much that numba doesn't just use the unqualified string name of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should only be one of these now and should never have been two (likely a copy paste error). I'm not sure how numba resolves two functions that have the same name, I don't think it's by the name, my guess is that since both copies of it didn't do anything except pass that their bytecode or whatever numba uses to do its lookup was similar enough that it worked.

In any case, there's only one now and it's defined in _testing.



def sv_to_udf_str(sv):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also confused here about the requirement/use of this function, and the lowering. It seems (from the tests) like it must be user-facing functionality: since the string-udf is written by an external entity, how did they previously construct UDF strings (rather than string_view strings)? And so, if converting between the two is necessary, surely this should be a function in the toolkit of UDF authors.

If not, how is that UDF authors can write something that gets passed a string view but expects a UDF string?

@github-actions github-actions bot added conda Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Feb 2, 2023
@brandon-b-miller brandon-b-miller removed request for a team and divyegala February 2, 2023 14:51
@brandon-b-miller
Copy link
Contributor Author

Sorry all for the noise from the branch change of base here!

@brandon-b-miller brandon-b-miller removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue conda Java Affects Java cuDF API. labels Feb 2, 2023
@brandon-b-miller
Copy link
Contributor Author

This has been updated with the changes from #12669 and should be ready to go again. The code is largely the same as when we last looked at it, the difference mainly being the shuffling around of files.

python/cudf/cudf/core/udf/strings_lowering.py Outdated Show resolved Hide resolved
Comment on lines 18 to 32
bool_binary_funcs = ["startswith", "endswith"]
int_binary_funcs = ["find", "rfind"]
id_unary_funcs = [
"isalpha",
"isalnum",
"isdecimal",
"isdigit",
"isupper",
"islower",
"isspace",
"isnumeric",
"istitle",
]
string_unary_funcs = ["upper", "lower"]
string_return_attrs = ["strip", "lstrip", "rstrip"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, these don't seem to be used at all anymore. A few questions:

  1. Can these be deleted now?
  2. Was there a reason to switch from the old monkey-patching approach to the new approach other than clarity, i.e. does it impact whether things work for both UDFString and StringView instead of just one of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up reverting this because it was mostly a refactor I forgot to undo. I'll clean this up in a separate PR.



def get_kernel(func, dtype, size):
def get_kernels(func, dtype, size):
"""
Create a kernel for testing a single scalar string function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring needs updating to reflect that you create two separate kernels now, one for each type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed



def sv_to_udf_str(sv):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandon-b-miller could still use this documentation (unless I missed it somewhere during my review).

@brandon-b-miller
Copy link
Contributor Author

sorry I forgot to get to this last week @vyasr , I think I addressed your comments. I added some docs to sv_to_udf_str in 0840cdc, curious what you think as this piece of things can be a little hard to explain conceptually.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the new docstring is quite helpful!

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/merge

@wence-
Copy link
Contributor

wence- commented Mar 9, 2023

/merge

@rapids-bot rapids-bot bot merged commit 52c675a into rapidsai:branch-23.04 Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change numba Numba issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants