-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add read-only functions on string dtypes to DataFrame.apply
and Series.apply
#11319
Merged
rapids-bot
merged 245 commits into
rapidsai:branch-22.10
from
brandon-b-miller:string_udfs
Sep 20, 2022
Merged
Add read-only functions on string dtypes to DataFrame.apply
and Series.apply
#11319
rapids-bot
merged 245 commits into
rapidsai:branch-22.10
from
brandon-b-miller:string_udfs
Sep 20, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: Vyas Ramasubramani <[email protected]>
raydouglass
reviewed
Sep 14, 2022
rerun tests |
Can you remerge with 22.10? There are bunch of extra files in the PR. |
brandon-b-miller
changed the base branch from
feature/string_udfs
to
branch-22.10
September 20, 2022 14:17
shwina
approved these changes
Sep 20, 2022
This should be resolved with the change of base. |
davidwendt
approved these changes
Sep 20, 2022
@gpucibot merge |
raydouglass
pushed a commit
that referenced
this pull request
Sep 29, 2022
## Description This switches to using CubinLinker (from PTXCompiler, but CubinLinker uses PTXCompiler internally) for Minor Version Compatibility. This enables support for all Numba features except linking archives with MVC, in support of use cases such as String UDFs (#11319) with MVC. ## Checklist - [X] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [X] New or existing tests cover these changes. - [X] The documentation is up to date with these changes. Authors: - Graham Markall (https://github.com/gmarkall) - https://github.com/brandon-b-miller - Ashwin Srinath (https://github.com/shwina) Approvers: - Ray Douglass (https://github.com/raydouglass)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
3 - Ready for Review
Ready for review by team
CMake
CMake build issue
feature request
New feature or request
non-breaking
Non-breaking change
Python
Affects Python cuDF API.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR provides initial support for string data inside UDFs passed to
DataFrame.apply
andSeries.apply
. The allowed APIs are based on python'sstr
class. It aims to implement python string semantics as closely as possible starting with APIs that return numeric data only. These are the following 21 functions:str.count
str.startswith
str.endswith
str.find
str.rfind
str.isalnum
str.isdecimal
str.isdigit
str.islower
str.isupper
str.isalpha
str.istitle
str.isspace
==
,!=
,>=
,<=
,>
,<
(between two strings)len
__contains__
The following 3 functions are not included due to having no libcudf equivalent code available to back them (due to them referring to python concepts)
str.isascii
str.isidentifier
str.isprintable
This works by creating a library of
__device__
functions based on libcudf which perform the above functions for one single string. The rest of the code is a library of numba extensions that replace a python UDF with a chain of those__device__
functions and creates a kernel that calls the result across a grid of threads, taking a full column of strings as input.cc @davidwendt @gmarkall