Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move strings_udf code into cuDF #12669

Merged
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
452f45f
move over lots of code and make cudf importable
brandon-b-miller Jan 31, 2023
bf2a7bc
python cleanup first pass
brandon-b-miller Jan 31, 2023
b2c6dbd
remove strings_udf package, old test file moved
brandon-b-miller Jan 31, 2023
4abb7c2
pass tests in test_string_udfs
brandon-b-miller Jan 31, 2023
cb78882
python cleanup second pass
brandon-b-miller Jan 31, 2023
b2921ea
masked type imports the string not the other way around :)
brandon-b-miller Feb 1, 2023
fb35e98
python cleanup third pass
brandon-b-miller Feb 1, 2023
3eeff2f
have fun deleting code
brandon-b-miller Feb 1, 2023
5b21a5e
Merge branch 'branch-23.04' into sunset-stringsudf-package
brandon-b-miller Feb 7, 2023
f37206c
python cleanup fourth pass
brandon-b-miller Feb 7, 2023
a32da85
remove unnecessary code
brandon-b-miller Feb 7, 2023
77c5d1d
merge latest
brandon-b-miller Feb 10, 2023
3f4a409
remove remaining references to strings_udf in code
brandon-b-miller Feb 10, 2023
93bad36
only have one PTX file
brandon-b-miller Feb 10, 2023
84fb00f
merge latest
brandon-b-miller Feb 10, 2023
7859e0a
merge latest
brandon-b-miller Feb 14, 2023
f9fa7a0
suitecode -> exitcode
brandon-b-miller Feb 14, 2023
1963937
only link cudf_strings_udf to strings_udf cython module
brandon-b-miller Feb 14, 2023
fbcf7c1
use initfunc
brandon-b-miller Feb 14, 2023
f424b0a
don't allow pyobject to map to a maskedtype
brandon-b-miller Feb 14, 2023
cd7bec0
simplify logic
brandon-b-miller Feb 14, 2023
c304f5c
address reviews, only load one ptx file
brandon-b-miller Feb 15, 2023
4cbe437
reorder
brandon-b-miller Feb 15, 2023
9b175e9
cmake updates
brandon-b-miller Feb 16, 2023
24c0de0
Merge branch 'branch-23.04' into sunset-stringsudf-package
brandon-b-miller Feb 16, 2023
7ff60a0
fix notebooks
brandon-b-miller Feb 16, 2023
dc5f3f9
Merge branch 'branch-23.04' into sunset-stringsudf-package
brandon-b-miller Feb 17, 2023
fc324d4
debug commit
brandon-b-miller Feb 17, 2023
9f522fc
second debug commit
brandon-b-miller Feb 17, 2023
44a67aa
more debugging
brandon-b-miller Feb 17, 2023
aa75ec5
revert changes
brandon-b-miller Feb 21, 2023
3a841de
revert more changes
brandon-b-miller Feb 21, 2023
473ac7a
look for libcudf.so one dir up
brandon-b-miller Feb 21, 2023
2945dc8
Merge remote-tracking branch 'origin/branch-23.04' into sunset-string…
vyasr Feb 21, 2023
f8ccee6
Remove redundant cuda arch init.
vyasr Feb 21, 2023
e8f58eb
nvrtc may not be needed
brandon-b-miller Feb 22, 2023
e84e243
Merge branch 'sunset-stringsudf-package' of github.com:brandon-b-mill…
brandon-b-miller Feb 22, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
python/cudf/cudf/_version.py export-subst
python/strings_udf/strings_udf/_version.py export-subst
python/cudf_kafka/cudf_kafka/_version.py export-subst
python/custreamz/custreamz/_version.py export-subst
python/dask_cudf/dask_cudf/_version.py export-subst
2 changes: 0 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,6 @@ python/cudf_kafka/*/_lib/**/*.cpp
python/cudf_kafka/*/_lib/**/*.h
python/custreamz/*/_lib/**/*.cpp
python/custreamz/*/_lib/**/*.h
python/strings_udf/strings_udf/_lib/*.cpp
python/strings_udf/strings_udf/*.ptx
.Python
env/
develop-eggs/
Expand Down
10 changes: 1 addition & 9 deletions build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ ARGS=$*
# script, and that this script resides in the repo dir!
REPODIR=$(cd $(dirname $0); pwd)

VALIDARGS="clean libcudf cudf cudfjar dask_cudf benchmarks tests libcudf_kafka cudf_kafka custreamz strings_udf -v -g -n -l --allgpuarch --disable_nvtx --opensource_nvcomp --show_depr_warn --ptds -h --build_metrics --incl_cache_stats"
VALIDARGS="clean libcudf cudf cudfjar dask_cudf benchmarks tests libcudf_kafka cudf_kafka custreamz -v -g -n -l --allgpuarch --disable_nvtx --opensource_nvcomp --show_depr_warn --ptds -h --build_metrics --incl_cache_stats"
HELP="$0 [clean] [libcudf] [cudf] [cudfjar] [dask_cudf] [benchmarks] [tests] [libcudf_kafka] [cudf_kafka] [custreamz] [-v] [-g] [-n] [-h] [--cmake-args=\\\"<args>\\\"]
clean - remove all existing build artifacts and configuration (start
over)
Expand Down Expand Up @@ -335,14 +335,6 @@ if buildAll || hasArg cudf; then
fi
fi

if buildAll || hasArg strings_udf; then

cd ${REPODIR}/python/strings_udf
python setup.py build_ext --inplace -- -DCMAKE_PREFIX_PATH=${INSTALL_PREFIX} -DCMAKE_LIBRARY_PATH=${LIBCUDF_BUILD_DIR} -DCMAKE_CUDA_ARCHITECTURES=${CUDF_CMAKE_CUDA_ARCHITECTURES} ${EXTRA_CMAKE_ARGS} -- -j${PARALLEL_LEVEL:-1}
if [[ ${INSTALL_TARGET} != "" ]]; then
python setup.py install --single-version-externally-managed --record=record.txt -- -DCMAKE_PREFIX_PATH=${INSTALL_PREFIX} -DCMAKE_LIBRARY_PATH=${LIBCUDF_BUILD_DIR} ${EXTRA_CMAKE_ARGS} -- -j${PARALLEL_LEVEL:-1}
fi
fi

# Build and install the dask_cudf Python package
if buildAll || hasArg dask_cudf; then
Expand Down
7 changes: 1 addition & 6 deletions ci/build_python.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2023, NVIDIA CORPORATION.

set -euo pipefail

Expand Down Expand Up @@ -38,10 +38,5 @@ rapids-mamba-retry mambabuild \
--channel "${RAPIDS_CONDA_BLD_OUTPUT_DIR}" \
conda/recipes/custreamz

rapids-mamba-retry mambabuild \
--no-test \
--channel "${CPP_CHANNEL}" \
--channel "${RAPIDS_CONDA_BLD_OUTPUT_DIR}" \
conda/recipes/strings_udf

rapids-upload-conda-to-s3 python
2 changes: 0 additions & 2 deletions ci/release/update-version.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,6 @@ sed_runner 's/'"VERSION ${CURRENT_SHORT_TAG}.*"'/'"VERSION ${NEXT_FULL_TAG}"'/g'
# Python update
sed_runner 's/'"cudf_version .*)"'/'"cudf_version ${NEXT_FULL_TAG})"'/g' python/cudf/CMakeLists.txt

# Strings UDF update
sed_runner 's/'"strings_udf_version .*)"'/'"strings_udf_version ${NEXT_FULL_TAG})"'/g' python/strings_udf/CMakeLists.txt

# cpp libcudf_kafka update
sed_runner 's/'"VERSION ${CURRENT_SHORT_TAG}.*"'/'"VERSION ${NEXT_FULL_TAG}"'/g' cpp/libcudf_kafka/CMakeLists.txt
Expand Down
37 changes: 0 additions & 37 deletions ci/test_python_other.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,41 +44,4 @@ pytest \
custreamz
popd

set -e
rapids-mamba-retry install \
--channel "${CPP_CHANNEL}" \
--channel "${PYTHON_CHANNEL}" \
strings_udf
set +e

rapids-logger "pytest strings_udf"
pushd python/strings_udf/strings_udf
pytest \
--cache-clear \
--junitxml="${RAPIDS_TESTS_DIR}/junit-strings-udf.xml" \
--numprocesses=8 \
--dist=loadscope \
--cov-config=.coveragerc \
--cov=strings_udf \
--cov-report=xml:"${RAPIDS_COVERAGE_DIR}/strings-udf-coverage.xml" \
--cov-report=term \
tests
popd

rapids-logger "pytest cudf with strings_udf"
pushd python/cudf/cudf
pytest \
--cache-clear \
--ignore="benchmarks" \
--junitxml="${RAPIDS_TESTS_DIR}/junit-cudf-strings-udf.xml" \
--numprocesses=8 \
--dist=loadscope \
--cov-config=../.coveragerc \
--cov=cudf \
--cov-report=xml:"${RAPIDS_COVERAGE_DIR}/cudf-strings-udf-coverage.xml" \
--cov-report=term \
tests/test_udf_masked_ops.py
popd

rapids-logger "Test script exiting with value: $EXITCODE"
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
exit ${EXITCODE}
4 changes: 0 additions & 4 deletions conda/recipes/strings_udf/build.sh

This file was deleted.

14 changes: 0 additions & 14 deletions conda/recipes/strings_udf/conda_build_config.yaml

This file was deleted.

78 changes: 0 additions & 78 deletions conda/recipes/strings_udf/meta.yaml

This file was deleted.

2 changes: 1 addition & 1 deletion python/cudf/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ endif()
rapids_cython_init()

add_subdirectory(cudf/_lib)
add_subdirectory(udf_cpp/groupby)
add_subdirectory(udf_cpp)

include(cmake/Modules/ProtobufHelpers.cmake)
codegen_protoc(cudf/utils/metadata/orc_column_statistics.proto)
Expand Down
5 changes: 4 additions & 1 deletion python/cudf/cudf/_lib/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# =============================================================================
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2023, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
Expand Down Expand Up @@ -46,6 +46,7 @@ set(cython_sources
sort.pyx
stream_compaction.pyx
string_casting.pyx
strings_udf.pyx
text.pyx
transform.pyx
transpose.pyx
Expand All @@ -61,6 +62,8 @@ rapids_cython_create_modules(
LINKED_LIBRARIES "${linked_libraries}" ASSOCIATED_TARGETS cudf
)

target_link_libraries(strings_udf cudf_strings_udf)

# TODO: Finding NumPy currently requires finding Development due to a bug in CMake. This bug was
# fixed in https://gitlab.kitware.com/cmake/cmake/-/merge_requests/7410 and will be available in
# CMake 3.24, so we can remove the Development component once we upgrade to CMake 3.24.
Expand Down
3 changes: 2 additions & 1 deletion python/cudf/cudf/_lib/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
# Copyright (c) 2020-2023, NVIDIA CORPORATION.
import numpy as np

from . import (
Expand Down Expand Up @@ -33,6 +33,7 @@
stream_compaction,
string_casting,
strings,
strings_udf,
text,
transpose,
unary,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2023, NVIDIA CORPORATION.

from libc.stdint cimport uint8_t, uint16_t
from libcpp.memory cimport unique_ptr
from libcpp.string cimport string
from libcpp.vector cimport vector

from rmm._lib.device_buffer cimport DeviceBuffer, device_buffer

from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.column.column_view cimport column_view
from cudf._lib.cpp.types cimport size_type
from rmm._lib.device_buffer cimport DeviceBuffer, device_buffer


cdef extern from "cudf/strings/udf/udf_string.hpp" namespace \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,25 @@
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2023, NVIDIA CORPORATION.

from libc.stdint cimport uint8_t, uint16_t, uintptr_t

from cudf._lib.cpp.strings_udf cimport (
get_character_cases_table as cpp_get_character_cases_table,
get_character_flags_table as cpp_get_character_flags_table,
get_special_case_mapping_table as cpp_get_special_case_mapping_table,
)

import numpy as np

from libcpp.memory cimport unique_ptr
from libcpp.utility cimport move

from cudf.core.buffer import as_buffer

from cudf._lib.column cimport Column
from cudf._lib.cpp.column.column cimport column, column_view
from rmm._lib.device_buffer cimport DeviceBuffer, device_buffer

from strings_udf._lib.cpp.strings_udf cimport (
from cudf._lib.column cimport Column
from cudf._lib.cpp.column.column cimport column, column_view
from cudf._lib.cpp.strings_udf cimport (
column_from_udf_string_array as cpp_column_from_udf_string_array,
free_udf_string_array as cpp_free_udf_string_array,
to_string_view_array as cpp_to_string_view_array,
Expand Down Expand Up @@ -39,3 +49,18 @@ def column_from_udf_string_array(DeviceBuffer d_buffer):
result = Column.from_unique_ptr(move(c_result))

return result


def get_character_flags_table_ptr():
cdef const uint8_t* tbl_ptr = cpp_get_character_flags_table()
return np.uintp(<uintptr_t>tbl_ptr)


def get_character_cases_table_ptr():
cdef const uint16_t* tbl_ptr = cpp_get_character_cases_table()
return np.uintp(<uintptr_t>tbl_ptr)


def get_special_case_mapping_table_ptr():
cdef const void* tbl_ptr = cpp_get_special_case_mapping_table()
return np.uintp(<uintptr_t>tbl_ptr)
7 changes: 2 additions & 5 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -4192,11 +4192,8 @@ def apply(
For more information, see the `cuDF guide to user defined functions
<https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html>`__.

Support for use of string data within UDFs is provided through the
`strings_udf <https://anaconda.org/rapidsai-nightly/strings_udf>`__
RAPIDS library. Supported operations on strings include the subset of
functions and string methods that expect an input string but do not
return a string. Refer to caveats in the UDF guide referenced above.
Some string functions and methods are supported. Refer to the guide
to UDFs for details.

Parameters
----------
Expand Down
2 changes: 0 additions & 2 deletions python/cudf/cudf/core/indexed_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2112,7 +2112,6 @@ def _apply(self, func, kernel_getter, *args, **kwargs):
"""Apply `func` across the rows of the frame."""
if kwargs:
raise ValueError("UDFs using **kwargs are not yet supported.")

try:
kernel, retty = _compile_or_get(
self, func, args, kernel_getter=kernel_getter
Expand All @@ -2130,7 +2129,6 @@ def _apply(self, func, kernel_getter, *args, **kwargs):
output_args = [(ans_col, ans_mask), len(self)]
input_args = _get_input_args_from_frame(self)
launch_args = output_args + input_args + list(args)

try:
kernel.forall(len(self))(*launch_args)
except Exception as e:
Expand Down
7 changes: 2 additions & 5 deletions python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2298,11 +2298,8 @@ def apply(self, func, convert_dtype=True, args=(), **kwargs):
For more information, see the `cuDF guide to user defined functions
<https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html>`__.

Support for use of string data within UDFs is provided through the
`strings_udf <https://anaconda.org/rapidsai-nightly/strings_udf>`__
RAPIDS library. Supported operations on strings include the subset of
functions and string methods that expect an input string but do not
return a string. Refer to caveats in the UDF guide referenced above.
Some string functions and methods are supported. Refer to the guide
to UDFs for details.

Parameters
----------
Expand Down
Loading