-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Initial support for string UDFs via Numba #9639
Comments
This issue has been labeled |
I believe @brandon-b-miller is actively working on this. Whilst I'm commenting, I'll add a note that numba/numba#7621 helps support this implementation so may be a useful reference along with numba/numba-examples#40, which requires a similar mechanism of linking CUDA C/C++ with Numba kernels. |
This is being worked on, albeit slowly for now. We've had a lot of discussions of how we intend to proceed with this offline, but the general consensus is that some of these functions will be a lot easier to support than others, namely the ones that have predictable memory requirements. Hopefully more to come here soon. |
This issue has been labeled |
This issue has been labeled |
This is still being worked on. |
This issue has been labeled |
This is still active. |
@brandon-b-miller Should this be moved to a different project board? |
@brandon-b-miller do you want to keep using this issue to track the remaining work as well (the methods that output strings)? |
Just wanted to provide an update on this feature since we now have partial support for this and I think we have a clear picture of what's left to be done and a tentative timeline. Here is a summary. 22.10 introduced string udfs via the
CEC More features (methods that produce non numeric data)
The above functions currently require cuda dynamic global memory allocation and can therefore have some unpredictable performance characteristics. We hope to make this problem go away in the future. Wont add for now Hopefully this gives us something to work with for now and hopefully more updates to this thread in the future! |
This PR adds support for the following three functions in `strings_udf`: - `str.strip(other)` - `str.lstrip(other)` - `str.rstrip(other)` Part of #9639 Authors: - https://github.com/brandon-b-miller - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #12091
This PR adds support for the following operator `strings_udf`: - `st + other` Part of #9639 Authors: - https://github.com/brandon-b-miller - David Wendt (https://github.com/davidwendt) Approvers: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #12117
This PR adds support for the following two functions in `strings_udf`: - `str.upper()` - `str.lower()` Part of #9639 Authors: - https://github.com/brandon-b-miller - David Wendt (https://github.com/davidwendt) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) - David Wendt (https://github.com/davidwendt) URL: #12099
This PR adds support for the following function in `strings_udf`: - `str.replace` Part of #9639 Authors: - https://github.com/brandon-b-miller Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #12207
@brandon-b-miller could you update this issue with the current state of UDFs? |
Is your feature request related to a problem? Please describe.
Currently we can't use string columns inside UDFs. This is for a number of reasons. Firstly, there is limited support for strings in general in Numba, which forms the basis of our UDF framework. Secondly even if strings were supported in numba, we would still need to extend numba for it to be able to properly generate kernels that work as we expect on the buffers containing our string data. Lastly, there are special memory considerations on the GPU that complicate the situation further.
Describe the solution you'd like
Recently @davidwendt has experimented with a c++ class which solves many of the nuances around handling single strings that live on the device inside UDFs. @gmarkall subsequently wrote a proof of concept showing how simple string functions such as
len
can be overloaded using numba to map to the methods contained in that c++ class and baked into a kernel. We would like to plumb this machinery through cuDF. This roughly consists of the following steps:capitalize
casefold
center
count
encode
endswith
expandtabs
find
format
format_map
index
isalnum
isalpha
isascii
isdecimal
isdigit
islower
isprintable
isspace
istitle
isupper
join
ljust
lower
lstrip
maketrans
removeprefix
removesuffix
replace
rfind
rindex
rjust
rpartition
rsplit
rstrip
split
splitlines
startswith
swapcase
title
translate
upper
zfill
Concretely, when we encounter a UDF that is written like this for example:
Our code should
len
that we will write which expects aMaskedType(string)
and returns aMaskedType(int64)
len
method when provided a pointer to the start of the stringDescribe alternatives you've considered
Additional context
If we can get this to work it lays the groundwork for being able to use other more complex types inside UDFs in the future, following the same pattern of using numba to map python code to external function calls that we write to operate on a single data element.
Similar issue for
applymap
#3802The text was updated successfully, but these errors were encountered: