Move `strings_udf` code into cuDF #12669

brandon-b-miller · 2023-02-01T17:45:41Z

With the merge of #11452 we have the machinery to build and deploy PTX libraries of shim functions as part of cuDF's build process. With this there is no reason to keep the strings_udf code separate anymore. This PR removes the separate package and all of it's related CI plumbing as well as supports the strings feature by default, just like GroupBy.

bdice

This is so nice! I have a few minor comments, some of which I think were deferred from the original strings_udf PR.

I assumed that most of the code changes in this PR were pure moves, and I didn't look too closely at the things that seemed familiar.

python/cudf/cudf/core/udf/__init__.py

bdice · 2023-02-01T18:11:44Z

python/cudf/cudf/core/udf/utils.py

+heap_size = 0
+cudf_str_dtype = dtype(str)


Should these be public? The second one looks like it should be a module constant.

Suggested change

heap_size = 0

cudf_str_dtype = dtype(str)

_heap_size = 0

_CUDF_STR_DTYPE = dtype(str)

Agreed, more generally this module has a lot of module-level vars that should probably be internal and prefixed with underscores.

bdice · 2023-02-01T18:14:52Z

python/cudf/cudf/core/udf/strings_typing.py

-from cudf.core.udf._ops import comparison_ops
-from cudf.core.udf.masked_typing import MaskedType
+# libcudf size_type
+size_type = types.int32


Let's move this definition next to where we hardcode other libcudf information like size_type_dtype (or reuse that declaration, if possible).

cudf/python/cudf/cudf/_lib/types.pyx

Line 20 in 6a67e8f

size_type_dtype = np.dtype("int32")

codecov · 2023-02-01T18:53:24Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@a308b24). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 3eeff2f differs from pull request most recent head e84e243. Consider uploading reports for the commit e84e243 to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-23.04   #12669   +/-   ##
===============================================
  Coverage                ?   85.80%           
===============================================
  Files                   ?      154           
  Lines                   ?    25128           
  Branches                ?        0           
===============================================
  Hits                    ?    21561           
  Misses                  ?     3567           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

bdice · 2023-02-01T19:27:51Z

@brandon-b-miller Can we rename this PR? “Sunset” sounds like a deprecation to me, but we’re doing a hard break because the feature was experimental. I’d propose “Move strings_udf into cudf package” or similar.

I also labeled this as breaking because it’s changing what packages are built and deployed.

bdice · 2023-02-01T19:30:59Z

@brandon-b-miller Also can we file (or update) a companion issue that tracks any needed changes to resources outside this repo like blogs, RAPIDS website docs, etc. that tell users to install strings_udf?

vyasr

Looks like there are still some references to strings_udf in this repo based on a git grep. Let me know if you need help tracking everything down.

Excited to see this happening! Love to see a -3500 LoC PR.

python/cudf/cudf/_lib/CMakeLists.txt

python/cudf/cudf/core/udf/utils.py

vyasr · 2023-02-08T23:51:40Z

python/cudf/cudf/core/udf/utils.py

+heap_size = 0
+cudf_str_dtype = dtype(str)


Agreed, more generally this module has a lot of module-level vars that should probably be internal and prefixed with underscores.

python/cudf/cudf/core/udf/masked_lowering.py

python/cudf/cudf/core/udf/masked_typing.py

vyasr · 2023-02-09T00:02:01Z

python/cudf/cudf/core/udf/masked_typing.py

+
+
+# Strings functions and utilities
+def _is_valid_string_arg(ty):


I'm a bit confused. It looks like the contents of strings_udf/_typing was moved into core/udf/strings_typing and then the old contents of strings_typing were moved into this file. Could you comment on the rationale for the current separation? Is there a distinction between strings and nullable strings logic being made?

Yes! This is a great question. Our MaskedType extension is the type we use to carry around a value and a validity. When using DataFrame.apply or Series.apply, the scalar inside your UDF are really MaskedType:

df = cudf.DataFrame( { 'a':[1, 2, 3], 'b':['a','b','c'] # one string column 'b' }) def f(row): x = row['a'] # MaskedType(int64) y = row['b'] # MaskedType(string_view) z = y.upper() # MaskedType(udf_string), not used just for demo return x + len(y)

I always thought the cleanest way of putting this together was to have the string_view and udf_string types actually implement the string methods, like upper or len, whereas the MaskedType that carries it around holds the nullability logic. At that point calling something like len(MaskedType(string_view)) is programmed in the lowering to translate roughly to:

len(MaskedType(string_view, valid=True)) = MaskedType(len(string_view), valid=True)

meaning it ends up taking the validity of the source. That would make our strings just another type of scalar that could be masked.

The requirement of the external package made a lot of this way harder though since we needed a lot of this to be optional. Before this PR we had this happening:

string scalars and their ops defined in strings_udf

Masked string operations defined in cudf (in strings_typing.py and strings_lowering.py deliberately in a separate place so that it could be optionally imported/registered

cudf optionally importing strings_typing and strings_lowering which contained all the overloads for masked strings

Now what we have is:

string scalars and their ops defined in strings_typing.py and strings_lowering.py

Masked string operations defined in masked_typing.py and masked_lowering.py

Nothing is optional

I think the second way is a lot cleaner.

Idle thought, not for implementation in this PR, is there a way we can take advantage of what looks to be the functorial nature of the MaskedType object to simplify the code and reuse that from strings_typing/lowering more systematically?

Thanks for the great explanation! That makes sense, and the new way is certainly simpler without the old extra package.

ttnghia · 2023-02-09T21:59:14Z

This is quite a large PR. Can we split it into smaller PRs somehow?

bdice · 2023-02-09T22:02:34Z

This is quite a large PR. Can we split it into smaller PRs somehow?

Most of this PR is a pure move, but the pieces are pretty intertwined and it would be hard to break up. The code has been previously reviewed by several folks, so it's all familiar and shouldn't be hard to review.

bdice

Packaging questions. Changes are probably needed.

ci/test_python_other.sh

bdice · 2023-02-15T17:21:44Z

conda/recipes/strings_udf/meta.yaml

-    - cudf ={{ version }}
-    - {{ pin_compatible('cudatoolkit', max_pin='x', min_pin='x') }}
-    - cachetools
-    - ptxcompiler >=0.7.0  # CUDA enhanced compatibility. See https://github.com/rapidsai/ptxcompiler


We don't require ptxcompiler >=0.7.0 in the cudf recipe. We should, right?

Added this. Currently this is being inherited through cubinlinker.

bdice · 2023-02-15T17:24:42Z

python/cudf/cudf/core/udf/utils.py

+
+
+strings_ptx_file = _get_ptx_file(os.path.dirname(__file__), "shim_")
+ptx_files.append(strings_ptx_file)


Why don't we just initialize ptx_files with this value above? We don't need to initialize-and-modify (append).

I refactored this a bit in c304f5c .

bdice

Looks good to me! Thanks for your iterations, @brandon-b-miller.

vyasr

One small change then I think this is good to go.

python/cudf/udf_cpp/CMakeLists.txt

review-notebook-app · 2023-02-16T19:37:06Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

ajschmidt8

Approving ops-codeowner file changes

…sudf-package

…er/cudf into sunset-stringsudf-package

brandon-b-miller · 2023-02-22T14:14:28Z

/merge

brandon-b-miller added 8 commits January 31, 2023 10:31

move over lots of code and make cudf importable

452f45f

python cleanup first pass

bf2a7bc

remove strings_udf package, old test file moved

b2c6dbd

pass tests in test_string_udfs

4abb7c2

python cleanup second pass

cb78882

masked type imports the string not the other way around :)

b2921ea

python cleanup third pass

fb35e98

have fun deleting code

3eeff2f

brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Feb 1, 2023

github-actions bot added ci CMake CMake build issue labels Feb 1, 2023

bdice reviewed Feb 1, 2023

View reviewed changes

bdice added breaking Breaking change and removed non-breaking Non-breaking change labels Feb 1, 2023

Merge branch 'branch-23.04' into sunset-stringsudf-package

5b21a5e

brandon-b-miller changed the title ~~Sunset separate strings_udf package~~ Move strings_udf code into cuDF Feb 7, 2023

brandon-b-miller added 2 commits February 7, 2023 13:09

python cleanup fourth pass

f37206c

remove unnecessary code

a32da85

vyasr mentioned this pull request Feb 8, 2023

Allow casting from UDFString back to StringView to call methods in strings_udf #12363

Merged

vyasr requested changes Feb 9, 2023

View reviewed changes

merge latest

77c5d1d

brandon-b-miller marked this pull request as ready for review February 10, 2023 14:31

brandon-b-miller added 4 commits February 14, 2023 07:18

only link cudf_strings_udf to strings_udf cython module

1963937

use initfunc

fbcf7c1

don't allow pyobject to map to a maskedtype

f424b0a

simplify logic

cd7bec0

bdice requested changes Feb 15, 2023

View reviewed changes

brandon-b-miller added 2 commits February 15, 2023 13:17

address reviews, only load one ptx file

c304f5c

reorder

4cbe437

bdice approved these changes Feb 15, 2023

View reviewed changes

vyasr requested changes Feb 16, 2023

View reviewed changes

python/cudf/udf_cpp/CMakeLists.txt Outdated Show resolved Hide resolved

brandon-b-miller added 2 commits February 16, 2023 07:50

cmake updates

9b175e9

Merge branch 'branch-23.04' into sunset-stringsudf-package

24c0de0

vyasr approved these changes Feb 16, 2023

View reviewed changes

brandon-b-miller requested a review from ajschmidt8 February 16, 2023 16:17

fix notebooks

7ff60a0

Merge branch 'branch-23.04' into sunset-stringsudf-package

dc5f3f9

bdice mentioned this pull request Feb 17, 2023

Cache JIT GroupBy.apply functions #12802

Merged

debug commit

fc324d4

ajschmidt8 approved these changes Feb 17, 2023

View reviewed changes

brandon-b-miller and others added 9 commits February 17, 2023 10:49

second debug commit

9f522fc

more debugging

44a67aa

revert changes

aa75ec5

revert more changes

3a841de

look for libcudf.so one dir up

473ac7a

Merge remote-tracking branch 'origin/branch-23.04' into sunset-string…

2945dc8

…sudf-package

Remove redundant cuda arch init.

f8ccee6

nvrtc may not be needed

e8f58eb

Merge branch 'sunset-stringsudf-package' of github.com:brandon-b-mill…

e84e243

…er/cudf into sunset-stringsudf-package

rapids-bot bot merged commit f90ae52 into rapidsai:branch-23.04 Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `strings_udf` code into cuDF #12669

Move `strings_udf` code into cuDF #12669

brandon-b-miller commented Feb 1, 2023

bdice left a comment

bdice Feb 1, 2023

vyasr Feb 8, 2023

bdice Feb 1, 2023

codecov bot commented Feb 1, 2023 •

edited

Loading

bdice commented Feb 1, 2023 •

edited

Loading

bdice commented Feb 1, 2023

vyasr left a comment •

edited

Loading

vyasr Feb 8, 2023

vyasr Feb 9, 2023

brandon-b-miller Feb 9, 2023 •

edited

Loading

wence- Feb 13, 2023

vyasr Feb 16, 2023

ttnghia commented Feb 9, 2023

bdice commented Feb 9, 2023

bdice left a comment

bdice Feb 15, 2023

brandon-b-miller Feb 15, 2023

bdice Feb 15, 2023

brandon-b-miller Feb 15, 2023

bdice left a comment

vyasr left a comment

review-notebook-app bot commented Feb 16, 2023

ajschmidt8 left a comment

brandon-b-miller commented Feb 22, 2023



		# Strings functions and utilities
		def _is_valid_string_arg(ty):



		strings_ptx_file = _get_ptx_file(os.path.dirname(__file__), "shim_")
		ptx_files.append(strings_ptx_file)

Move strings_udf code into cuDF #12669

Move strings_udf code into cuDF #12669

Conversation

brandon-b-miller commented Feb 1, 2023

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 1, 2023 • edited Loading

Codecov Report

bdice commented Feb 1, 2023 • edited Loading

bdice commented Feb 1, 2023

vyasr left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller Feb 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia commented Feb 9, 2023

bdice commented Feb 9, 2023

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

review-notebook-app bot commented Feb 16, 2023

ajschmidt8 left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Feb 22, 2023

Move `strings_udf` code into cuDF #12669

Move `strings_udf` code into cuDF #12669

codecov bot commented Feb 1, 2023 •

edited

Loading

bdice commented Feb 1, 2023 •

edited

Loading

vyasr left a comment •

edited

Loading

brandon-b-miller Feb 9, 2023 •

edited

Loading