Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read-only functions on string dtypes to DataFrame.apply and Series.apply #11319

Merged
merged 245 commits into from
Sep 20, 2022
Merged
Show file tree
Hide file tree
Changes from 125 commits
Commits
Show all changes
245 commits
Select commit Hold shift + click to select a range
84592ee
plumbed but returning zeros
brandon-b-miller Feb 3, 2022
f18f03a
up and running
brandon-b-miller Feb 4, 2022
0921c8f
move code around
brandon-b-miller Feb 10, 2022
2aa6299
remove old file
brandon-b-miller Feb 10, 2022
38a1c44
bugfixes and a temporary workaround
brandon-b-miller Feb 11, 2022
b7d6ba2
Merge branch 'branch-22.04' into string_udfs
brandon-b-miller Feb 15, 2022
2d0b505
successfully casting literal->masked string
brandon-b-miller Feb 23, 2022
a3e0b7f
startswith lowering
brandon-b-miller Feb 23, 2022
2dc33f7
Merge branch 'branch-22.04' into string_udfs
brandon-b-miller Feb 28, 2022
087a9f2
fix bugs in startswith, which now works
brandon-b-miller Feb 28, 2022
2930e75
add typing and lowering for ends_with
brandon-b-miller Mar 8, 2022
f190d43
implement rfind
brandon-b-miller Mar 8, 2022
b056f41
add some basic testing
brandon-b-miller Mar 8, 2022
06596c0
fix bug
brandon-b-miller Mar 8, 2022
ad983f8
run black
brandon-b-miller Mar 8, 2022
9eda436
adjust and document StringViewArgHandler
brandon-b-miller Mar 9, 2022
bd56aaa
adjust doc
brandon-b-miller Mar 9, 2022
d12b6da
move things around
brandon-b-miller Mar 9, 2022
f2a72f0
improvements to struct modeling logic
brandon-b-miller Mar 9, 2022
426820f
fix caching issue
brandon-b-miller Mar 9, 2022
132e6b8
struct size is only 16?
brandon-b-miller Mar 9, 2022
22ce467
minor update
brandon-b-miller Mar 9, 2022
92df351
merge latest
brandon-b-miller Apr 6, 2022
c6fbdcf
fix bug introduced by bad merge
brandon-b-miller Apr 6, 2022
1e9dec3
partial implementation for returning strings
brandon-b-miller Apr 6, 2022
dd416a2
returning a string is working
brandon-b-miller Apr 8, 2022
0ebcd07
remove stale breakpoint
brandon-b-miller Apr 8, 2022
297070a
a wrong and not working, but informative attempt at upper()
brandon-b-miller Apr 11, 2022
e3456de
returning dstrings instead of string_views
brandon-b-miller Apr 11, 2022
55abe5e
fully convert to dstring
brandon-b-miller Apr 12, 2022
1dbd427
tests and plumbing for upper
brandon-b-miller Apr 12, 2022
f030f90
plumbed lower
brandon-b-miller Apr 12, 2022
c05bd6d
merge latest
brandon-b-miller Apr 13, 2022
4c95fc3
partial getitem implementation for ints, not working yet
brandon-b-miller Apr 13, 2022
f09eceb
substring plumbing, also not working (for the same reason as at)
brandon-b-miller Apr 13, 2022
b0fb8d2
substr RUNNING but different results from pandas...
brandon-b-miller Apr 14, 2022
3691c89
add strip, rstrip, and lstrip
brandon-b-miller Apr 14, 2022
66e92e2
working through +, not working yet
brandon-b-miller Apr 18, 2022
b1a8588
Merge branch 'branch-22.06' into string_udfs
brandon-b-miller May 13, 2022
1cf26a4
plumb out to stringudfs
brandon-b-miller May 23, 2022
e61815d
plumbing startswith and endswith back in
brandon-b-miller May 26, 2022
47b07e8
Merge branch 'branch-22.08' into string_udfs
brandon-b-miller May 31, 2022
19f491a
updates from strings_udf/main
brandon-b-miller Jun 15, 2022
adad2bd
Merge branch 'branch-22.08' into string_udfs
brandon-b-miller Jul 6, 2022
9897a2b
Merge branch 'branch-22.08' into string_udfs
brandon-b-miller Jul 12, 2022
bca8fe8
plumb back in several more functions
brandon-b-miller Jul 12, 2022
f165547
support comparison operators between strings. Reuses existing lowering?
brandon-b-miller Jul 14, 2022
551150e
Merge branch 'branch-22.08' into string_udfs
brandon-b-miller Jul 19, 2022
a232a20
first attempt at adding strings_udf as a subpackage
brandon-b-miller Jul 19, 2022
f9f0ac2
add other _is_ functions
brandon-b-miller Jul 20, 2022
35d238b
updates
brandon-b-miller Jul 20, 2022
cc661b3
strings_udf lowering shall not import cudf, and functions shall be at…
brandon-b-miller Jul 21, 2022
3710d9b
Fix style checks for string_udfs.
bdice Jul 21, 2022
5602da8
Merge pull request #1 from bdice/string_udfs-style
brandon-b-miller Jul 22, 2022
c0d212c
Merge branch 'branch-22.08' into string_udfs
brandon-b-miller Jul 22, 2022
2442bf2
configure strings_udf to be optional based on a runtime check
brandon-b-miller Jul 27, 2022
0b35e16
remove stake breakpoint
brandon-b-miller Jul 27, 2022
2a010f4
skip string tests if not enabled
brandon-b-miller Aug 1, 2022
8bd7773
add a build script
brandon-b-miller Aug 1, 2022
92229b7
add build script to main build script
brandon-b-miller Aug 1, 2022
6058a2b
merge latest
brandon-b-miller Aug 1, 2022
1d868b4
pass style checks, move functions to lowering
brandon-b-miller Aug 2, 2022
df05054
Merge branch 'branch-22.10' into string_udfs
brandon-b-miller Aug 3, 2022
c8aa6f8
more wrangling with style checks, somehow
brandon-b-miller Aug 3, 2022
8635f28
last try
brandon-b-miller Aug 3, 2022
d208c5b
Fix isort.
bdice Aug 3, 2022
022f820
Merge pull request #2 from bdice/isort-fix
brandon-b-miller Aug 3, 2022
60e60be
Proposed isort changes.
bdice Aug 3, 2022
05fefbd
Merge pull request #3 from bdice/isort-fix
brandon-b-miller Aug 3, 2022
fb3820f
allow import on CPU only machine
brandon-b-miller Aug 3, 2022
eed2a52
catch importerror
brandon-b-miller Aug 3, 2022
8d4d10b
updates
brandon-b-miller Aug 3, 2022
099a1fc
do not import stringview in masked constructor
brandon-b-miller Aug 3, 2022
26be2a0
start adding strings_udf to ci test scripts
brandon-b-miller Aug 3, 2022
6df94b7
excize string udf imports in masked typing
brandon-b-miller Aug 4, 2022
8ac7455
excise string_udf imports from row_function.py
brandon-b-miller Aug 4, 2022
77d19b1
remove license
brandon-b-miller Aug 5, 2022
6095fa9
Merge branch 'branch-22.10' into string_udfs
brandon-b-miller Aug 8, 2022
938f017
begin adding and testing conda package
brandon-b-miller Aug 8, 2022
4119b6d
conda and build updates
brandon-b-miller Aug 9, 2022
d6e043b
add cython to meta.yaml
brandon-b-miller Aug 9, 2022
2cbf750
try yaml from cudf
brandon-b-miller Aug 9, 2022
53dd3f6
add conda_build_config.yaml
brandon-b-miller Aug 9, 2022
436f16f
use skbuild
brandon-b-miller Aug 10, 2022
293652d
update meta.yaml
brandon-b-miller Aug 10, 2022
43a11aa
build using build system independent commands
brandon-b-miller Aug 11, 2022
2733360
change around cuda version constraints, adjust cmake to ship ptx hope…
brandon-b-miller Aug 15, 2022
501cabb
Merge branch 'branch-22.10' into string_udfs
brandon-b-miller Aug 15, 2022
4273949
prune out dstring until phase 2 and CLI files for now
brandon-b-miller Aug 16, 2022
196f0dd
add back mistakenly removed udf_apis.hpp, which is required
brandon-b-miller Aug 16, 2022
241fd3b
add back in just the needed parts of udf_apis, resolve cython errors
brandon-b-miller Aug 16, 2022
0912904
Merge branch 'branch-22.10' into string_udfs
brandon-b-miller Aug 17, 2022
71e5cab
fix _get_frame_row_type struct alignment issues
brandon-b-miller Aug 17, 2022
24d6022
move PTX to conda prefix
brandon-b-miller Aug 17, 2022
996d944
minor update
brandon-b-miller Aug 17, 2022
2b669de
create skbuild process based on the one from cudf
brandon-b-miller Aug 19, 2022
8324831
run pre-commit
brandon-b-miller Aug 19, 2022
08904f7
install versioneer into subpackage
brandon-b-miller Aug 22, 2022
7e985d3
ignore versioneer in all subpackages
brandon-b-miller Aug 22, 2022
1406d88
link to libcudf_strings_udf during cython build
brandon-b-miller Aug 22, 2022
29bc117
small fixes
brandon-b-miller Aug 22, 2022
2aafe03
update properties of setup.cfg and rerun pre-commit
brandon-b-miller Aug 22, 2022
d2bc899
pyarrow 8->9 in meta.yaml
brandon-b-miller Aug 22, 2022
1a5341a
updates from local CI testing
brandon-b-miller Aug 24, 2022
b293eb6
simplify setup.py
brandon-b-miller Aug 24, 2022
7df7172
no need to rebuild strings_udf on test nodes, just install from packa…
brandon-b-miller Aug 30, 2022
475f62a
build strings_udf conda package on cpu_build jobs
brandon-b-miller Aug 30, 2022
0460abc
build both 3.8 and 3.9 conda packages on CPU only, and add debugging …
brandon-b-miller Aug 31, 2022
1b307b9
restrict strings_udf based on version of CUDA originally used to comp…
brandon-b-miller Aug 31, 2022
d847ad8
merge and resolve conflicts
brandon-b-miller Sep 2, 2022
67bb0a7
small cleanups
brandon-b-miller Sep 2, 2022
b9acab9
prune files and update to strings_udf repo
brandon-b-miller Sep 2, 2022
3852d79
format c++ files
brandon-b-miller Sep 2, 2022
30b1a45
Update python/strings_udf/cpp/include/cudf/strings/udf/char_types.cuh
davidwendt Sep 2, 2022
6a51daa
Apply suggestions from code review
brandon-b-miller Sep 6, 2022
fe1cddc
address reviews
brandon-b-miller Sep 6, 2022
8b2c1e7
merge test files
brandon-b-miller Sep 6, 2022
f19ead6
Apply suggestions from code review
brandon-b-miller Sep 7, 2022
b3eeb9f
prune cython files
brandon-b-miller Sep 7, 2022
c4b4c7b
prune imports and correct use of gil
brandon-b-miller Sep 7, 2022
9231469
define size_type = types.int32 and use throughout strings_udf and cud…
brandon-b-miller Sep 7, 2022
985572f
move functions around in tests/utils.py
brandon-b-miller Sep 7, 2022
bfcb0ec
add tests and bindings for MaskedType(string_view).count()
brandon-b-miller Sep 7, 2022
7958150
move .clang-format to repo root
brandon-b-miller Sep 7, 2022
1183e8c
search for cuda version using re
brandon-b-miller Sep 7, 2022
6e152ff
Apply suggestions from code review
brandon-b-miller Sep 8, 2022
66cbc3d
address reviews
brandon-b-miller Sep 8, 2022
b99ddd6
Update count() logic
davidwendt Sep 8, 2022
11a35d7
add more parenthesis
davidwendt Sep 9, 2022
1f78b05
remove blank line
davidwendt Sep 9, 2022
0d3f761
include install guard in build.sh
brandon-b-miller Sep 9, 2022
c3e174c
revert changes to update-version.sh
brandon-b-miller Sep 9, 2022
2bca928
prune meta.yaml
brandon-b-miller Sep 9, 2022
35e159e
delete strings_udf from namespace if not enabled
brandon-b-miller Sep 9, 2022
25c4a7c
Apply suggestions from code review
brandon-b-miller Sep 9, 2022
e799232
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
brandon-b-miller Sep 9, 2022
99b7f3b
use a decorator to create lowerings dynamically
brandon-b-miller Sep 9, 2022
c44c575
dynamically generate typing
brandon-b-miller Sep 9, 2022
cf118e9
renaming
brandon-b-miller Sep 9, 2022
54a0013
Update python/strings_udf/strings_udf/_lib/cudf_jit_udf.pyx
brandon-b-miller Sep 9, 2022
c3f1968
clean tables.pyx
brandon-b-miller Sep 9, 2022
2dc437b
relink imports from strings_udf library
brandon-b-miller Sep 9, 2022
a64a745
dynamically generate attributes of MaskedType(string_view)
brandon-b-miller Sep 9, 2022
8938f1b
dynamically generated lowerings for MaskedType(string_view)
brandon-b-miller Sep 9, 2022
c4438c5
simplify test suite using fixtures
brandon-b-miller Sep 9, 2022
1faf49d
use a fixture in test_string_udfs
brandon-b-miller Sep 9, 2022
c0bd0f5
remove comments in shim.cu
brandon-b-miller Sep 9, 2022
80b176e
prune setup.py
brandon-b-miller Sep 9, 2022
67fada7
address reviews in __init__.py
brandon-b-miller Sep 9, 2022
45396fb
more __init__ updates
brandon-b-miller Sep 9, 2022
7489e7f
update test utils in strings_udf library
brandon-b-miller Sep 9, 2022
49db996
merge test utils into test_string_udfs.py
brandon-b-miller Sep 9, 2022
95b6f5e
Change C++ strings_udf code to be built as part of the Python code.
vyasr Sep 9, 2022
6272327
Simplify CUDA handling.
vyasr Sep 9, 2022
d003e85
Reorganize to clearly separate the C++ functionality from the shim code.
vyasr Sep 9, 2022
cc01707
Install the C++ lib.
vyasr Sep 9, 2022
10f3379
Compile shims for every architecture.
vyasr Sep 9, 2022
d204a06
Minor cleanup.
vyasr Sep 9, 2022
6416d9d
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
vyasr Sep 9, 2022
3a9be14
remove cpp build step from build.sh
brandon-b-miller Sep 10, 2022
8189f71
Remove unnecessary files.
vyasr Sep 10, 2022
138dd9a
Define some more fixtures.
vyasr Sep 10, 2022
9eb3bfd
Simplify binary func lowering.
vyasr Sep 10, 2022
f0d3c90
Pass signature params directly.
vyasr Sep 10, 2022
2bce019
Define alias for CPointer to string_view.
vyasr Sep 10, 2022
5fa1ccd
Some more boilerplate reduction.
vyasr Sep 10, 2022
3db5b91
Remove one more unnecessary file.
vyasr Sep 10, 2022
acadf53
Remove extra new line.
vyasr Sep 10, 2022
c945aa1
Minor CMake cleanup.
vyasr Sep 10, 2022
a8b6cf4
update copyright
davidwendt Sep 12, 2022
17598d6
fix small bugs in cuda function declarations
brandon-b-miller Sep 12, 2022
7c55deb
rework __init__ such that a CPU machine may import strings_udf
brandon-b-miller Sep 12, 2022
02f93d4
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
brandon-b-miller Sep 12, 2022
5ab660d
Fix return calculation for zero-length target in count()
davidwendt Sep 12, 2022
6f67b69
find and rfind return size_type
brandon-b-miller Sep 12, 2022
3be55cb
add ptx files in the strings_udf directory to .gitignore
brandon-b-miller Sep 12, 2022
b6f3482
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
brandon-b-miller Sep 12, 2022
9aaea65
Fix style check
davidwendt Sep 12, 2022
4ed60fb
pass rather than raise if no strings_udf
brandon-b-miller Sep 12, 2022
dfd819e
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
brandon-b-miller Sep 12, 2022
3acc721
skip strings_udf tests in library if not enabled
brandon-b-miller Sep 12, 2022
b70e5a0
debug print to find why files are not being found
brandon-b-miller Sep 12, 2022
325abb6
Run tests from inside the strings_udf directory.
vyasr Sep 12, 2022
e145105
Remove __init__ file that could be causing file traversal.
vyasr Sep 12, 2022
12abaf0
Add extra debug print.
vyasr Sep 12, 2022
3fbc85b
Clean up debug prints and remove unused patch_needed import.
vyasr Sep 13, 2022
52e686c
only add object to types if strings_udf is enabled
brandon-b-miller Sep 13, 2022
6fbb0d4
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
brandon-b-miller Sep 13, 2022
faf4878
add strings_udf to cudf conda recipe
brandon-b-miller Sep 13, 2022
36f41f6
cleanup
brandon-b-miller Sep 13, 2022
7ac5b7e
Remove cudf from strings_udf requirements.
vyasr Sep 13, 2022
6549668
go back to an optional dependency
brandon-b-miller Sep 13, 2022
a7afe7a
fix yaml error
brandon-b-miller Sep 13, 2022
69813de
dont directly import ptxpath in tests
brandon-b-miller Sep 13, 2022
18bd4bc
Enable flake8 again.
vyasr Sep 13, 2022
c71c2dd
Don't return a nonzero exit status when strings_udf tests are all ski…
vyasr Sep 13, 2022
5b585c7
Simplify declaration of lowering functions.
vyasr Sep 13, 2022
b86d44e
retest with strings_udf after initial tests
brandon-b-miller Sep 13, 2022
dadbaa4
Use more fixtures for the test.
vyasr Sep 13, 2022
243aa7a
Minor CMake cleanup.
vyasr Sep 13, 2022
5514b3e
Typo fixes.
vyasr Sep 13, 2022
713f23d
Switch prefix with postfix.
vyasr Sep 13, 2022
f324ebc
Some copyright fixes.
vyasr Sep 13, 2022
313757c
Remove unnecessary cmdclass modification.
vyasr Sep 13, 2022
fd1041a
Improve PTX file handling and error cases.
vyasr Sep 13, 2022
0e177de
Fix issue with rapids-cmake changing the variable name.
vyasr Sep 13, 2022
267c5af
More copyright fixes.
vyasr Sep 13, 2022
5bb80ff
Remove unnecessary temp variable.
vyasr Sep 13, 2022
a16a83d
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
vyasr Sep 13, 2022
b30ca2f
Add missing @param
davidwendt Sep 13, 2022
7c1851c
Update python/cudf/cudf/core/udf/row_function.py
brandon-b-miller Sep 14, 2022
5ca6abc
Remove extraneous include.
vyasr Sep 14, 2022
356135b
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
vyasr Sep 14, 2022
c0ce94a
Move to_string_view_array into the appropriate namespaces.
vyasr Sep 14, 2022
0867205
Add missing nogil and except +.
vyasr Sep 14, 2022
c4f9133
Use auto.
vyasr Sep 14, 2022
23b3020
Add consts.
vyasr Sep 14, 2022
51ac693
Commit missed detail header with detail::to_string_view_array declara…
vyasr Sep 14, 2022
84b4d4b
Add missing stream parameter.
vyasr Sep 14, 2022
f2e52cf
fix bugs with imported implementation names
brandon-b-miller Sep 14, 2022
2f47e9f
Remove header exposing detail API.
vyasr Sep 14, 2022
b1a5fa6
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
brandon-b-miller Sep 14, 2022
ab87921
Fix style.
vyasr Sep 14, 2022
07dbded
address other reviews
brandon-b-miller Sep 14, 2022
c3b0221
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
brandon-b-miller Sep 14, 2022
c1f686d
refactor _typing and strings_typing
brandon-b-miller Sep 14, 2022
7975606
style
brandon-b-miller Sep 14, 2022
6080e25
import funcs from strings_udf
brandon-b-miller Sep 14, 2022
b2062ea
fix len test
brandon-b-miller Sep 14, 2022
f964da4
add more doxygen to count()
davidwendt Sep 14, 2022
2e19a66
Remove unneeded header includes
davidwendt Sep 14, 2022
b7ad6c9
Update python/strings_udf/strings_udf/tests/test_string_udfs.py
brandon-b-miller Sep 14, 2022
345e417
Update python/cudf/cudf/tests/test_udf_masked_ops.py
brandon-b-miller Sep 14, 2022
b8fe76f
docs
brandon-b-miller Sep 14, 2022
c2b1267
elaborate on table_ptr
brandon-b-miller Sep 14, 2022
1f8cb3b
use in ci/cpu/build.sh
brandon-b-miller Sep 14, 2022
78e1078
add strings_udf to upload.sh
brandon-b-miller Sep 14, 2022
24090de
merge 22.10
brandon-b-miller Sep 19, 2022
db0c589
revert to building package for different python versions manually
brandon-b-miller Sep 19, 2022
b88def9
merge 22.10 again
brandon-b-miller Sep 19, 2022
cf0ae5b
Explicitly catch return code instead of relying on a function.
vyasr Sep 19, 2022
15aacc2
merge latest from 22.10
brandon-b-miller Sep 20, 2022
cda1415
Merge branch 'string_udfs' of github.com:brandon-b-miller/cudf into s…
brandon-b-miller Sep 20, 2022
40e6cd6
style
brandon-b-miller Sep 20, 2022
036eeec
cudf/core/udf/__init__.py is sensitive to import order, skip on isort
brandon-b-miller Sep 20, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
python/cudf/cudf/_version.py export-subst
python/strings_udf/strings_udf/_version.py export-subst
vyasr marked this conversation as resolved.
Show resolved Hide resolved
CHANGELOG.md merge=union
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ python/cudf_kafka/*/_lib/**/*\.cpp
python/cudf_kafka/*/_lib/**/*.h
python/custreamz/*/_lib/**/*\.cpp
python/custreamz/*/_lib/**/*.h
python/strings_udf/strings_udf/_lib/*.cpp
.Python
env/
develop-eggs/
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ repos:
args: ["--config=setup.cfg"]
files: python/.*\.(py|pyx|pxd)$
types: [file]
exclude: python/.*/versioneer.py
vyasr marked this conversation as resolved.
Show resolved Hide resolved
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v0.782'
hooks:
Expand Down
17 changes: 16 additions & 1 deletion build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ ARGS=$*
# script, and that this script resides in the repo dir!
REPODIR=$(cd $(dirname $0); pwd)

VALIDARGS="clean libcudf cudf cudfjar dask_cudf benchmarks tests libcudf_kafka cudf_kafka custreamz -v -g -n -l --allgpuarch --disable_nvtx --opensource_nvcomp --show_depr_warn --ptds -h --build_metrics --incl_cache_stats"
VALIDARGS="clean libcudf cudf cudfjar dask_cudf benchmarks tests libcudf_kafka cudf_kafka custreamz strings_udf -v -g -n -l --allgpuarch --disable_nvtx --opensource_nvcomp --show_depr_warn --ptds -h --build_metrics --incl_cache_stats"
HELP="$0 [clean] [libcudf] [cudf] [cudfjar] [dask_cudf] [benchmarks] [tests] [libcudf_kafka] [cudf_kafka] [custreamz] [-v] [-g] [-n] [-h] [--cmake-args=\\\"<args>\\\"]
clean - remove all existing build artifacts and configuration (start
over)
Expand Down Expand Up @@ -335,6 +335,21 @@ if buildAll || hasArg cudf; then
fi
fi

if buildAll || hasArg strings_udf; then
# do not separately expose strings_udf c++ library
# always build python and c++ at the same time and include into the same conda package
cd ${REPODIR}/python/strings_udf/cpp
ls
cmake -S ./ -B build -DCONDA_PREFIX=${INSTALL_PREFIX} -DCMAKE_INSTALL_PREFIX=${INSTALL_PREFIX}/
cmake --build build
cmake --install ./build
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
cd ${REPODIR}/python/strings_udf
python setup.py build_ext --inplace -- -DCMAKE_PREFIX_PATH=${INSTALL_PREFIX} -DCMAKE_LIBRARY_PATH=${LIBCUDF_BUILD_DIR} ${EXTRA_CMAKE_ARGS} -- -j${PARALLEL_LEVEL:-1}
if [[ ${INSTALL_TARGET} != "" ]]; then
python setup.py install --single-version-externally-managed --record=record.txt -- -DCMAKE_PREFIX_PATH=${INSTALL_PREFIX} -DCMAKE_LIBRARY_PATH=${LIBCUDF_BUILD_DIR} ${EXTRA_CMAKE_ARGS} -- -j${PARALLEL_LEVEL:-1}
fi
fi

# Build and install the dask_cudf Python package
if buildAll || hasArg dask_cudf; then

Expand Down
13 changes: 13 additions & 0 deletions ci/cpu/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,15 @@ fi
if [ "$BUILD_LIBCUDF" == '1' ]; then
gpuci_logger "Build conda pkg for libcudf"
gpuci_conda_retry mambabuild --no-build-id --croot ${CONDA_BLD_DIR} conda/recipes/libcudf $CONDA_BUILD_ARGS

# BUILD_LIBCUDF == 1 means this job is being run on the cpu_build jobs
# that is where we must also build the strings_udf package
gpuci_logger "Build conda pkg for strings_udf (python 3.8)"
gpuci_conda_retry mambabuild --no-build-id --croot ${CONDA_BLD_DIR} conda/recipes/strings_udf $CONDA_BUILD_ARGS --python=3.8
ajschmidt8 marked this conversation as resolved.
Show resolved Hide resolved
gpuci_logger "Build conda pkg for strings_udf (python 3.9)"
gpuci_conda_retry mambabuild --no-build-id --croot ${CONDA_BLD_DIR} conda/recipes/strings_udf $CONDA_BUILD_ARGS --python=3.9


mkdir -p ${CONDA_BLD_DIR}/libcudf/work
cp -r ${CONDA_BLD_DIR}/work/* ${CONDA_BLD_DIR}/libcudf/work
gpuci_logger "sccache stats"
Expand Down Expand Up @@ -108,6 +117,10 @@ if [ "$BUILD_CUDF" == '1' ]; then

gpuci_logger "Build conda pkg for custreamz"
gpuci_conda_retry mambabuild --croot ${CONDA_BLD_DIR} conda/recipes/custreamz --python=$PYTHON $CONDA_BUILD_ARGS $CONDA_CHANNEL

gpuci_logger "Build conda pkg for strings_udf"
gpuci_conda_retry mambabuild --croot ${CONDA_BLD_DIR} conda/recipes/strings_udf --python=$PYTHON $CONDA_BUILD_ARGS $CONDA_CHANNEL

fi
################################################################################
# UPLOAD - Conda packages
Expand Down
17 changes: 13 additions & 4 deletions ci/gpu/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -121,11 +121,11 @@ if [[ -z "$PROJECT_FLASH" || "$PROJECT_FLASH" == "0" ]]; then
install_dask

################################################################################
# BUILD - Build libcudf, cuDF, libcudf_kafka, and dask_cudf from source
# BUILD - Build libcudf, cuDF, libcudf_kafka, dask_cudf, and strings_udf from source
################################################################################

gpuci_logger "Build from source"
"$WORKSPACE/build.sh" clean libcudf cudf dask_cudf libcudf_kafka cudf_kafka benchmarks tests --ptds
"$WORKSPACE/build.sh" clean libcudf cudf dask_cudf libcudf_kafka cudf_kafka strings_udf benchmarks tests --ptds

################################################################################
# TEST - Run GoogleTest
Expand Down Expand Up @@ -176,8 +176,12 @@ else
gpuci_conda_retry mambabuild --croot ${CONDA_BLD_DIR} conda/recipes/cudf_kafka --python=$PYTHON -c ${CONDA_ARTIFACT_PATH}
gpuci_conda_retry mambabuild --croot ${CONDA_BLD_DIR} conda/recipes/custreamz --python=$PYTHON -c ${CONDA_ARTIFACT_PATH}

gpuci_logger "Installing cudf, dask-cudf, cudf_kafka and custreamz"
gpuci_mamba_retry install cudf dask-cudf cudf_kafka custreamz -c "${CONDA_BLD_DIR}" -c "${CONDA_ARTIFACT_PATH}"
# the CUDA component of strings_udf must be built on cuda 11.5 just like libcudf
# but because there is no separate python package, we must also build the python on the 11.5 jobs
# this means that at this point (on the GPU test jobs) the whole package is already built and has been
# copied by CI from the upstream 11.5 jobs into $CONDA_ARTIFACT_PATH
gpuci_logger "Installing cudf, dask-cudf, cudf_kafka, custreamz, and strings_udf"
gpuci_mamba_retry install cudf dask-cudf cudf_kafka custreamz strings_udf -c "${CONDA_BLD_DIR}" -c "${CONDA_ARTIFACT_PATH}"

gpuci_logger "GoogleTests"
# Run libcudf and libcudf_kafka gtests from libcudf-tests package
Expand Down Expand Up @@ -243,6 +247,11 @@ cd "$WORKSPACE/python/custreamz"
gpuci_logger "Python py.test for cuStreamz"
py.test -n 8 --cache-clear --basetemp="$WORKSPACE/custreamz-cuda-tmp" --junitxml="$WORKSPACE/junit-custreamz.xml" -v --cov-config=.coveragerc --cov=custreamz --cov-report=xml:"$WORKSPACE/python/custreamz/custreamz-coverage.xml" --cov-report term custreamz

cd "$WORKSPACE/python/strings_udf"
gpuci_logger "Python py.test for strings_udf"
py.test -n 8 --cache-clear --basetemp="$WORKSPACE/strings-udf-cuda-tmp" --junitxml="$WORKSPACE/junit-strings-udf.xml" -v --cov-config=.coveragerc --cov=strings_udf --cov-report=xml:"$WORKSPACE/python/strings_udf/strings-udf-coverage.xml" --cov-report term strings_udf


# Run benchmarks with both cudf and pandas to ensure compatibility is maintained.
# Benchmarks are run in DEBUG_ONLY mode, meaning that only small data sizes are used.
# Therefore, these runs only verify that benchmarks are valid.
Expand Down
4 changes: 4 additions & 0 deletions conda/recipes/strings_udf/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright (c) 2022, NVIDIA CORPORATION.

# This assumes the script is executed from the root of the repo directory
./build.sh strings_udf
14 changes: 14 additions & 0 deletions conda/recipes/strings_udf/conda_build_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
c_compiler_version:
- 9

cxx_compiler_version:
- 9

sysroot_version:
- "2.17"

cmake_version:
- ">=3.20.1,!=3.23.0"

cuda_compiler:
- nvcc
79 changes: 79 additions & 0 deletions conda/recipes/strings_udf/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Copyright (c) 2022, NVIDIA CORPORATION.

{% set version = environ.get('GIT_DESCRIBE_TAG', '0.0.0.dev').lstrip('v') + environ.get('VERSION_SUFFIX', '') %}
{% set minor_version = version.split('.')[0] + '.' + version.split('.')[1] %}
{% set py_version=environ.get('CONDA_PY', 38) %}
{% set cuda_version='.'.join(environ.get('CUDA', '11.5').split('.')[:2]) %}
{% set cuda_major=cuda_version.split('.')[0] %}

package:
name: strings_udf
version: {{ version }}

source:
git_url: ../../..

build:
number: {{ GIT_DESCRIBE_NUMBER }}
string: cuda_{{ cuda_major }}_py{{ py_version }}_{{ GIT_DESCRIBE_HASH }}_{{ GIT_DESCRIBE_NUMBER }}
script_env:
- VERSION_SUFFIX
- PARALLEL_LEVEL
# libcudf's run_exports pinning is looser than we would like
ignore_run_exports:
- libcudf
ignore_run_exports_from:
- {{ compiler('cuda') }}

requirements:
build:
- cmake {{ cmake_version }}
- {{ compiler('c') }}
- {{ compiler('cxx') }}
- {{ compiler('cuda') }} {{ cuda_version }}
- sysroot_{{ target_platform }} {{ sysroot_version }}
host:
- protobuf>=3.20.1,<3.21.0a0
vyasr marked this conversation as resolved.
Show resolved Hide resolved
- python
- cython >=0.29,<0.30
- scikit-build>=0.13.1
- setuptools
- numba >=0.54
- dlpack>=0.5,<0.6.0a0
- pyarrow =9
- libcudf ={{ version }}
- cudf ={{ version }}
- rmm ={{ minor_version }}
- cudatoolkit ={{ cuda_version }}
run:
- protobuf>=3.20.1,<3.21.0a0
- python
- typing_extensions
- pandas >=1.0,<1.5.0dev0
- cupy >=9.5.0,<11.0.0a0
- numba >=0.54
- numpy
- {{ pin_compatible('pyarrow', max_pin='x.x.x') }}
- libcudf {{ version }}
- cudf ={{ version }}
- fastavro >=0.22.0
- {{ pin_compatible('rmm', max_pin='x.x') }}
- fsspec>=0.6.0
- {{ pin_compatible('cudatoolkit', max_pin='x', min_pin='x') }}
- nvtx >=0.2.1
- packaging
- cachetools
- ptxcompiler # [linux64] # CUDA enhanced compatibility. See https://github.com/rapidsai/ptxcompiler
- cuda-python >=11.5,<11.7.1
test: # [linux64]
requires: # [linux64]
- cudatoolkit {{ cuda_version }}.* # [linux64]
imports: # [linux64]
- strings_udf # [linux64]

about:
home: https://rapids.ai/
license: Apache-2.0
license_family: APACHE
license_file: LICENSE
summary: strings_udf library
30 changes: 12 additions & 18 deletions python/cudf/cudf/core/indexed_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,12 @@
from cudf.core.missing import NA
from cudf.core.multiindex import MultiIndex
from cudf.core.resample import _Resampler
from cudf.core.udf.utils import _compile_or_get, _supported_cols_from_frame
from cudf.core.udf.utils import (
_compile_or_get,
_get_input_args_from_frame,
_post_process_output_col,
_return_col_from_dtype,
)
from cudf.utils import docutils
from cudf.utils.utils import _cudf_nvtx_annotate

Expand Down Expand Up @@ -1799,30 +1804,19 @@ def _apply(self, func, kernel_getter, *args, **kwargs):
) from e

# Mask and data column preallocated
ans_col = cp.empty(len(self), dtype=retty)
ans_col = _return_col_from_dtype(retty, len(self))
ans_mask = cudf.core.column.column_empty(len(self), dtype="bool")
launch_args = [(ans_col, ans_mask), len(self)]
offsets = []

# if _compile_or_get succeeds, it is safe to create a kernel that only
# consumes the columns that are of supported dtype
for col in _supported_cols_from_frame(self).values():
data = col.data
mask = col.mask
if mask is None:
launch_args.append(data)
else:
launch_args.append((data, mask))
offsets.append(col.offset)
launch_args += offsets
launch_args += list(args)
output_args = [(ans_col, ans_mask), len(self)]
input_args = _get_input_args_from_frame(self)
launch_args = output_args + input_args + list(args)

try:
kernel.forall(len(self))(*launch_args)
except Exception as e:
raise RuntimeError("UDF kernel execution failed.") from e

col = cudf.core.column.as_column(ans_col)
col = _post_process_output_col(ans_col, retty)

col.set_base_mask(libcudf.transform.bools_to_mask(ans_mask))
result = cudf.Series._from_data({None: col}, self._index)

Expand Down
62 changes: 61 additions & 1 deletion python/cudf/cudf/core/udf/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,61 @@
from . import typing, lowering
# Copyright (c) 2022, NVIDIA CORPORATION.
from . import masked_typing, masked_lowering
from numba import cuda
from numba import types
from numba.cuda.cudaimpl import (
lower as cuda_lower,
registry as cuda_lowering_registry,
)
from cudf.core.udf import api
from cudf.core.udf import utils
from cudf.core.udf import row_function
from cudf.core.dtypes import dtype
import numpy as np

units = ["ns", "ms", "us", "s"]
vyasr marked this conversation as resolved.
Show resolved Hide resolved
datetime_cases = {types.NPDatetime(u) for u in units}
timedelta_cases = {types.NPTimedelta(u) for u in units}


supported_masked_types = (
types.integer_domain
| types.real_domain
| datetime_cases
| timedelta_cases
| {types.boolean}
)

_STRING_UDFS_ENABLED = False
try:
import strings_udf

if strings_udf.ENABLED:
from . import strings_typing
from . import strings_lowering
from strings_udf import ptxpath
from strings_udf._typing import string_view, str_view_arg_handler
from strings_udf._lib.cudf_jit_udf import to_string_view_array

# add an overload of MaskedType.__init__(string_view, bool)
cuda_lower(api.Masked, strings_typing.string_view, types.boolean)(
masked_lowering.masked_constructor
)

# add an overload of pack_return(string_view)
cuda_lower(api.pack_return, strings_typing.string_view)(
masked_lowering.pack_return_scalar_impl
)

supported_masked_types |= {strings_typing.string_view}
utils.launch_arg_getters[dtype("O")] = to_string_view_array
utils.masked_array_types[dtype("O")] = string_view
utils.files.append(ptxpath)
utils.arg_handlers.append(str_view_arg_handler)
row_function.itemsizes[dtype("O")] = string_view.size_bytes

_STRING_UDFS_ENABLED = True
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
except ImportError:
# allow cuDF to work without strings_udf
vyasr marked this conversation as resolved.
Show resolved Hide resolved
pass

masked_typing.register_masked_constructor(supported_masked_types)
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
comparison_ops,
unary_ops,
)
from cudf.core.udf.typing import MaskedType, NAType
from cudf.core.udf.masked_typing import MaskedType, NAType


@cuda_lowering_registry.lower_constant(NAType)
Expand Down Expand Up @@ -62,7 +62,6 @@ def masked_scalar_op_impl(context, builder, sig, args):
result = cgutils.create_struct_proxy(masked_return_type)(
context, builder
)

# compute output validity
valid = builder.and_(m1.valid, m2.valid)
result.valid = valid
Expand Down
Loading