Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Initial support for string UDFs via Numba #9639

Open
brandon-b-miller opened this issue Nov 9, 2021 · 12 comments
Open

[FEA] Initial support for string UDFs via Numba #9639

brandon-b-miller opened this issue Nov 9, 2021 · 12 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. numba Numba issue Python Affects Python cuDF API. strings strings issues (C++ and Python)

Comments

@brandon-b-miller
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Currently we can't use string columns inside UDFs. This is for a number of reasons. Firstly, there is limited support for strings in general in Numba, which forms the basis of our UDF framework. Secondly even if strings were supported in numba, we would still need to extend numba for it to be able to properly generate kernels that work as we expect on the buffers containing our string data. Lastly, there are special memory considerations on the GPU that complicate the situation further.

Describe the solution you'd like
Recently @davidwendt has experimented with a c++ class which solves many of the nuances around handling single strings that live on the device inside UDFs. @gmarkall subsequently wrote a proof of concept showing how simple string functions such as len can be overloaded using numba to map to the methods contained in that c++ class and baked into a kernel. We would like to plumb this machinery through cuDF. This roughly consists of the following steps:

  1. Make it so that when cuDF is built, the c++ string class and its methods are precompiled and made available as a blob of PTX or similar that we can link to when building a kernel in python.
  2. Create the pipeline in python that writes, links, compiles and executes the correct kernels that can leverage the aformentioned PTX blobs at runtime.
  3. Create numba typing and lowering that overloads calls to common string functions in python and maps them to the corresponding methods of the c++ class. Ideally we'd do all of them although some may be more complex than others due to memory considerations. Thats 43 functions:
  • capitalize
  • casefold
  • center
  • count
  • encode
  • endswith
  • expandtabs
  • find
  • format
  • format_map
  • index
  • isalnum
  • isalpha
  • isascii
  • isdecimal
  • isdigit
  • islower
  • isprintable
  • isspace
  • istitle
  • isupper
  • join
  • ljust
  • lower
  • lstrip
  • maketrans
  • removeprefix
  • removesuffix
  • replace
  • rfind
  • rindex
  • rjust
  • rpartition
  • rsplit
  • rstrip
  • split
  • splitlines
  • startswith
  • swapcase
  • title
  • translate
  • upper
  • zfill

Concretely, when we encounter a UDF that is written like this for example:

def f(row):
    return len(row['str_field'])

Our code should

  • Detect a declaration of len that we will write which expects a MaskedType(string) and returns a MaskedType(int64)
  • Detect an implementation of the above (lowering) which calls a compiled version of the c++ class's len method when provided a pointer to the start of the string
  • Write a kernel that distributes the individual column strings amongst parallel threads and runs the function capturing its output elementwise
  • Run it
  • If necessary assembles the results into a column

Describe alternatives you've considered

Additional context
If we can get this to work it lays the groundwork for being able to use other more complex types inside UDFs in the future, following the same pattern of using numba to map python code to external function calls that we write to operate on a single data element.

Similar issue for applymap #3802

@brandon-b-miller brandon-b-miller added feature request New feature or request numba Numba issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python) labels Nov 9, 2021
@brandon-b-miller brandon-b-miller self-assigned this Nov 9, 2021
@beckernick beckernick added this to the UDF Enhancements milestone Nov 9, 2021
@github-actions
Copy link

github-actions bot commented Dec 9, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@gmarkall
Copy link
Contributor

gmarkall commented Dec 9, 2021

I believe @brandon-b-miller is actively working on this.

Whilst I'm commenting, I'll add a note that numba/numba#7621 helps support this implementation so may be a useful reference along with numba/numba-examples#40, which requires a similar mechanism of linking CUDA C/C++ with Numba kernels.

@brandon-b-miller
Copy link
Contributor Author

This is being worked on, albeit slowly for now. We've had a lot of discussions of how we intend to proceed with this offline, but the general consensus is that some of these functions will be a lot easier to support than others, namely the ones that have predictable memory requirements. Hopefully more to come here soon.

@github-actions
Copy link

github-actions bot commented Jan 9, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Apr 9, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@gmarkall
Copy link
Contributor

This is still being worked on.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@gmarkall
Copy link
Contributor

This is still active.

@gmarkall
Copy link
Contributor

@brandon-b-miller Should this be moved to a different project board?

@vyasr
Copy link
Contributor

vyasr commented Oct 20, 2022

@brandon-b-miller do you want to keep using this issue to track the remaining work as well (the methods that output strings)?

@brandon-b-miller
Copy link
Contributor Author

brandon-b-miller commented Oct 21, 2022

Just wanted to provide an update on this feature since we now have partial support for this and I think we have a clear picture of what's left to be done and a tentative timeline. Here is a summary.

22.10 introduced string udfs via the strings_udf library
With the merge of #11319 (as well as a flurry of follow up fixes), a new separately installable package strings_udf was rolled out to support this. When present in the users environment, users will find that they are able to pass string columns to UDFs through DataFrame.apply and Series.apply and utilize the following hopefully familiar python methods within those UDFs:

  • str.count()
  • str.startswith()
  • str.endswith()
  • str.find()
  • str.rfind()
  • str.isalnum()
  • str.isalpha()
  • str.isdecimal()
  • str.isdigit()
  • str.islower()
  • str.isupper()
  • str.isnumeric()
  • str.isspace()
  • str.istitle()
  • Comparison operators between strings (==, !=, >, <, <=, >=)
  • Contains operation (str in other)
  • len(str)

CEC
CUDA 11.5 is currently required for this feature. CUDA enhanced compatibility is pending with PR
#11884.

More features (methods that produce non numeric data)
Functions and methods that return strings are being worked on for 22.12 with the main PR implementing the bulk of the plumbing at #11933. After this is merged, the following features will be added in phases:

  • str.capitalize()
  • str.upper()
  • str.lower()
  • str.swapcase()
  • str.ljust
  • str.rjust,
  • str.strip
  • str.lstrip
  • str.rstrip
  • str.removeprefix
  • str.removesuffix
  • str.title
  • str.center
  • str.expandtabs
  • str.replace
  • str.zfill
  • str.index
  • str.rindex
  • Substring indexing (str[1:3])
  • Concatenation (the + operator between strings)
  • Iteration (for char in str:)

The above functions currently require cuda dynamic global memory allocation and can therefore have some unpredictable performance characteristics. We hope to make this problem go away in the future.

Wont add for now
Some features like formatting are not yet on the roadmap, in addition to functions with structured return types such as split which returns a list.

Hopefully this gives us something to work with for now and hopefully more updates to this thread in the future!

rapids-bot bot pushed a commit that referenced this issue Nov 10, 2022
This PR adds support for the following three functions in `strings_udf`:

- `str.strip(other)`
- `str.lstrip(other)`
- `str.rstrip(other)`

Part of #9639

Authors:
  - https://github.com/brandon-b-miller
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #12091
rapids-bot bot pushed a commit that referenced this issue Nov 16, 2022
This PR adds support for the following operator `strings_udf`:

- `st + other`

Part of #9639

Authors:
  - https://github.com/brandon-b-miller
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #12117
rapids-bot bot pushed a commit that referenced this issue Nov 17, 2022
This PR adds support for the following two functions in `strings_udf`:

- `str.upper()`
- `str.lower()`

Part of #9639

Authors:
  - https://github.com/brandon-b-miller
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)
  - David Wendt (https://github.com/davidwendt)

URL: #12099
rapids-bot bot pushed a commit that referenced this issue Nov 30, 2022
This PR adds support for the following function in `strings_udf`:

- `str.replace`

Part of #9639

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #12207
@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

@brandon-b-miller could you update this issue with the current state of UDFs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. numba Numba issue Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
Status: Todo
Development

No branches or pull requests

4 participants