-
Notifications
You must be signed in to change notification settings - Fork 920
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add read-only functions on string dtypes to
DataFrame.apply
and `Se…
…ries.apply` (#11319) This PR provides initial support for string data inside UDFs passed to `DataFrame.apply` and `Series.apply`. The allowed APIs are based on python's `str` class. It aims to implement python string semantics as closely as possible starting with APIs that ***return numeric data only.*** These are the following 21 functions: - `str.count` - `str.startswith` - `str.endswith` - `str.find` - `str.rfind` - `str.isalnum` - `str.isdecimal` - `str.isdigit` - `str.islower` - `str.isupper` - `str.isalpha` - `str.istitle` - `str.isspace` - `==`, `!=`, `>=`, `<=`, `>`, `<` (between two strings) - `len` - `__contains__` The following 3 functions are not included due to having no libcudf equivalent code available to back them (due to them referring to python concepts) - `str.isascii` - `str.isidentifier` - `str.isprintable` This works by creating a library of `__device__` functions based on libcudf which perform the above functions for one single string. The rest of the code is a library of numba extensions that replace a python UDF with a chain of those `__device__` functions and creates a kernel that calls the result across a grid of threads, taking a full column of strings as input. cc @davidwendt @gmarkall Authors: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) - David Wendt (https://github.com/davidwendt) URL: #11319
- Loading branch information
1 parent
d10406f
commit 0528b38
Showing
45 changed files
with
5,629 additions
and
92 deletions.
There are no files selected for viewing
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
python/cudf/cudf/_version.py export-subst | ||
python/strings_udf/strings_udf/_version.py export-subst | ||
python/cudf_kafka/cudf_kafka/_version.py export-subst | ||
python/custreamz/custreamz/_version.py export-subst | ||
python/dask_cudf/dask_cudf/_version.py export-subst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Copyright (c) 2022, NVIDIA CORPORATION. | ||
|
||
# This assumes the script is executed from the root of the repo directory | ||
./build.sh strings_udf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
c_compiler_version: | ||
- 9 | ||
|
||
cxx_compiler_version: | ||
- 9 | ||
|
||
sysroot_version: | ||
- "2.17" | ||
|
||
cmake_version: | ||
- ">=3.20.1,!=3.23.0" | ||
|
||
cuda_compiler: | ||
- nvcc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Copyright (c) 2022, NVIDIA CORPORATION. | ||
|
||
{% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %} | ||
{% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %} | ||
{% set py_version=environ.get('CONDA_PY', 38) %} | ||
{% set cuda_version='.'.join(environ.get('CUDA', '11.5').split('.')[:2]) %} | ||
{% set cuda_major=cuda_version.split('.')[0] %} | ||
|
||
package: | ||
name: strings_udf | ||
version: {{ version }} | ||
|
||
source: | ||
git_url: ../../.. | ||
|
||
build: | ||
number: {{ GIT_DESCRIBE_NUMBER }} | ||
string: cuda_{{ cuda_major }}_py{{ py_version }}_{{ GIT_DESCRIBE_HASH }}_{{ GIT_DESCRIBE_NUMBER }} | ||
script_env: | ||
- VERSION_SUFFIX | ||
- PARALLEL_LEVEL | ||
# libcudf's run_exports pinning is looser than we would like | ||
ignore_run_exports: | ||
- libcudf | ||
ignore_run_exports_from: | ||
- {{ compiler('cuda') }} | ||
|
||
requirements: | ||
build: | ||
- cmake {{ cmake_version }} | ||
- {{ compiler('c') }} | ||
- {{ compiler('cxx') }} | ||
- {{ compiler('cuda') }} {{ cuda_version }} | ||
- sysroot_{{ target_platform }} {{ sysroot_version }} | ||
host: | ||
- python | ||
- cython >=0.29,<0.30 | ||
- scikit-build>=0.13.1 | ||
- setuptools | ||
- numba >=0.54 | ||
- libcudf ={{ version }} | ||
- cudf ={{ version }} | ||
- cudatoolkit ={{ cuda_version }} | ||
run: | ||
- python | ||
- typing_extensions | ||
- numba >=0.54 | ||
- numpy | ||
- libcudf ={{ version }} | ||
- cudf ={{ version }} | ||
- {{ pin_compatible('cudatoolkit', max_pin='x', min_pin='x') }} | ||
- cachetools | ||
- ptxcompiler # [linux64] # CUDA enhanced compatibility. See https://github.com/rapidsai/ptxcompiler | ||
test: # [linux64] | ||
requires: # [linux64] | ||
- cudatoolkit {{ cuda_version }}.* # [linux64] | ||
imports: # [linux64] | ||
- strings_udf # [linux64] | ||
|
||
about: | ||
home: https://rapids.ai/ | ||
license: Apache-2.0 | ||
license_family: APACHE | ||
license_file: LICENSE | ||
summary: strings_udf library |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,65 @@ | ||
# Copyright (c) 2021-2022, NVIDIA CORPORATION. | ||
# Copyright (c) 2022, NVIDIA CORPORATION. | ||
import numpy as np | ||
from numba import cuda, types | ||
from numba.cuda.cudaimpl import ( | ||
lower as cuda_lower, | ||
registry as cuda_lowering_registry, | ||
) | ||
|
||
from . import lowering, typing | ||
from cudf.core.dtypes import dtype | ||
from cudf.core.udf import api, row_function, utils | ||
from cudf.utils.dtypes import STRING_TYPES | ||
|
||
from . import masked_lowering, masked_typing | ||
|
||
_units = ["ns", "ms", "us", "s"] | ||
_datetime_cases = {types.NPDatetime(u) for u in _units} | ||
_timedelta_cases = {types.NPTimedelta(u) for u in _units} | ||
|
||
|
||
_supported_masked_types = ( | ||
types.integer_domain | ||
| types.real_domain | ||
| _datetime_cases | ||
| _timedelta_cases | ||
| {types.boolean} | ||
) | ||
|
||
_STRING_UDFS_ENABLED = False | ||
try: | ||
import strings_udf | ||
|
||
if strings_udf.ENABLED: | ||
from . import strings_typing # isort: skip | ||
from . import strings_lowering # isort: skip | ||
from strings_udf import ptxpath | ||
from strings_udf._lib.cudf_jit_udf import to_string_view_array | ||
from strings_udf._typing import str_view_arg_handler, string_view | ||
|
||
# add an overload of MaskedType.__init__(string_view, bool) | ||
cuda_lower(api.Masked, strings_typing.string_view, types.boolean)( | ||
masked_lowering.masked_constructor | ||
) | ||
|
||
# add an overload of pack_return(string_view) | ||
cuda_lower(api.pack_return, strings_typing.string_view)( | ||
masked_lowering.pack_return_scalar_impl | ||
) | ||
|
||
_supported_masked_types |= {strings_typing.string_view} | ||
utils.launch_arg_getters[dtype("O")] = to_string_view_array | ||
utils.masked_array_types[dtype("O")] = string_view | ||
utils.JIT_SUPPORTED_TYPES |= STRING_TYPES | ||
utils.ptx_files.append(ptxpath) | ||
utils.arg_handlers.append(str_view_arg_handler) | ||
row_function.itemsizes[dtype("O")] = string_view.size_bytes | ||
|
||
_STRING_UDFS_ENABLED = True | ||
else: | ||
del strings_udf | ||
|
||
except ImportError as e: | ||
# allow cuDF to work without strings_udf | ||
pass | ||
|
||
masked_typing.register_masked_constructor(_supported_masked_types) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.