Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read-only functions on string dtypes to DataFrame.apply and Series.apply #11319

Merged
merged 245 commits into from
Sep 20, 2022

Conversation

brandon-b-miller
Copy link
Contributor

This PR provides initial support for string data inside UDFs passed to DataFrame.apply and Series.apply. The allowed APIs are based on python's str class. It aims to implement python string semantics as closely as possible starting with APIs that return numeric data only. These are the following 21 functions:

  • str.count
  • str.startswith
  • str.endswith
  • str.find
  • str.rfind
  • str.isalnum
  • str.isdecimal
  • str.isdigit
  • str.islower
  • str.isupper
  • str.isalpha
  • str.istitle
  • str.isspace
  • ==, !=, >=, <=, >, < (between two strings)
  • len
  • __contains__

The following 3 functions are not included due to having no libcudf equivalent code available to back them (due to them referring to python concepts)

  • str.isascii
  • str.isidentifier
  • str.isprintable

This works by creating a library of __device__ functions based on libcudf which perform the above functions for one single string. The rest of the code is a library of numba extensions that replace a python UDF with a chain of those __device__ functions and creates a kernel that calls the result across a grid of threads, taking a full column of strings as input.

cc @davidwendt @gmarkall

@brandon-b-miller
Copy link
Contributor Author

rerun tests

@brandon-b-miller brandon-b-miller requested a review from a team as a code owner September 19, 2022 16:11
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 19, 2022
@brandon-b-miller brandon-b-miller removed the request for review from a team September 19, 2022 16:12
@davidwendt
Copy link
Contributor

Can you remerge with 22.10? There are bunch of extra files in the PR.

@brandon-b-miller brandon-b-miller changed the base branch from feature/string_udfs to branch-22.10 September 20, 2022 14:17
@github-actions github-actions bot removed the libcudf Affects libcudf (C++/CUDA) code. label Sep 20, 2022
@brandon-b-miller
Copy link
Contributor Author

Can you remerge with 22.10? There are bunch of extra files in the PR.

This should be resolved with the change of base.

@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0528b38 into rapidsai:branch-22.10 Sep 20, 2022
raydouglass pushed a commit that referenced this pull request Sep 29, 2022
## Description

This switches to using CubinLinker (from PTXCompiler, but CubinLinker uses PTXCompiler internally) for Minor Version Compatibility. This enables support for all Numba features except linking archives with MVC, in support of use cases such as String UDFs (#11319) with MVC.

## Checklist
- [X] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [X] New or existing tests cover these changes.
- [X] The documentation is up to date with these changes.

Authors:
   - Graham Markall (https://github.com/gmarkall)
   - https://github.com/brandon-b-miller
   - Ashwin Srinath (https://github.com/shwina)

Approvers:
   - Ray Douglass (https://github.com/raydouglass)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants