Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move strings_udf code into cuDF #12669

Merged

Conversation

brandon-b-miller
Copy link
Contributor

With the merge of #11452 we have the machinery to build and deploy PTX libraries of shim functions as part of cuDF's build process. With this there is no reason to keep the strings_udf code separate anymore. This PR removes the separate package and all of it's related CI plumbing as well as supports the strings feature by default, just like GroupBy.

@brandon-b-miller brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Feb 1, 2023
@github-actions github-actions bot added ci CMake CMake build issue labels Feb 1, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so nice! I have a few minor comments, some of which I think were deferred from the original strings_udf PR.

I assumed that most of the code changes in this PR were pure moves, and I didn't look too closely at the things that seemed familiar.

python/cudf/cudf/core/udf/__init__.py Show resolved Hide resolved
Comment on lines 50 to 51
heap_size = 0
cudf_str_dtype = dtype(str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be public? The second one looks like it should be a module constant.

Suggested change
heap_size = 0
cudf_str_dtype = dtype(str)
_heap_size = 0
_CUDF_STR_DTYPE = dtype(str)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, more generally this module has a lot of module-level vars that should probably be internal and prefixed with underscores.

from cudf.core.udf._ops import comparison_ops
from cudf.core.udf.masked_typing import MaskedType
# libcudf size_type
size_type = types.int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this definition next to where we hardcode other libcudf information like size_type_dtype (or reuse that declaration, if possible).

size_type_dtype = np.dtype("int32")

@codecov
Copy link

codecov bot commented Feb 1, 2023

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@a308b24). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 3eeff2f differs from pull request most recent head e84e243. Consider uploading reports for the commit e84e243 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04   #12669   +/-   ##
===============================================
  Coverage                ?   85.80%           
===============================================
  Files                   ?      154           
  Lines                   ?    25128           
  Branches                ?        0           
===============================================
  Hits                    ?    21561           
  Misses                  ?     3567           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bdice
Copy link
Contributor

bdice commented Feb 1, 2023

@brandon-b-miller Can we rename this PR? “Sunset” sounds like a deprecation to me, but we’re doing a hard break because the feature was experimental. I’d propose “Move strings_udf into cudf package” or similar.

I also labeled this as breaking because it’s changing what packages are built and deployed.

@bdice bdice added breaking Breaking change and removed non-breaking Non-breaking change labels Feb 1, 2023
@bdice
Copy link
Contributor

bdice commented Feb 1, 2023

@brandon-b-miller Also can we file (or update) a companion issue that tracks any needed changes to resources outside this repo like blogs, RAPIDS website docs, etc. that tell users to install strings_udf?

@brandon-b-miller brandon-b-miller changed the title Sunset separate strings_udf package Move strings_udf code into cuDF Feb 7, 2023
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are still some references to strings_udf in this repo based on a git grep. Let me know if you need help tracking everything down.

Excited to see this happening! Love to see a -3500 LoC PR.

python/cudf/cudf/_lib/CMakeLists.txt Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/utils.py Outdated Show resolved Hide resolved
Comment on lines 50 to 51
heap_size = 0
cudf_str_dtype = dtype(str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, more generally this module has a lot of module-level vars that should probably be internal and prefixed with underscores.

python/cudf/cudf/core/udf/masked_lowering.py Show resolved Hide resolved
python/cudf/cudf/core/udf/masked_typing.py Outdated Show resolved Hide resolved


# Strings functions and utilities
def _is_valid_string_arg(ty):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. It looks like the contents of strings_udf/_typing was moved into core/udf/strings_typing and then the old contents of strings_typing were moved into this file. Could you comment on the rationale for the current separation? Is there a distinction between strings and nullable strings logic being made?

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This is a great question. Our MaskedType extension is the type we use to carry around a value and a validity. When using DataFrame.apply or Series.apply, the scalar inside your UDF are really MaskedType:

df = cudf.DataFrame(
    {
        'a':[1, 2, 3], 
        'b':['a','b','c']  # one string column 'b'
}) 
def f(row):
    x = row['a'] # MaskedType(int64)
    y = row['b'] # MaskedType(string_view)
    z = y.upper() # MaskedType(udf_string), not used just for demo
    return x + len(y)

I always thought the cleanest way of putting this together was to have the string_view and udf_string types actually implement the string methods, like upper or len, whereas the MaskedType that carries it around holds the nullability logic. At that point calling something like len(MaskedType(string_view)) is programmed in the lowering to translate roughly to:

len(MaskedType(string_view, valid=True)) = MaskedType(len(string_view), valid=True)

meaning it ends up taking the validity of the source. That would make our strings just another type of scalar that could be masked.

The requirement of the external package made a lot of this way harder though since we needed a lot of this to be optional. Before this PR we had this happening:

  • string scalars and their ops defined in strings_udf
  • Masked string operations defined in cudf (in strings_typing.py and strings_lowering.py deliberately in a separate place so that it could be optionally imported/registered
  • cudf optionally importing strings_typing and strings_lowering which contained all the overloads for masked strings

Now what we have is:

  • string scalars and their ops defined in strings_typing.py and strings_lowering.py
  • Masked string operations defined in masked_typing.py and masked_lowering.py
  • Nothing is optional

I think the second way is a lot cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idle thought, not for implementation in this PR, is there a way we can take advantage of what looks to be the functorial nature of the MaskedType object to simplify the code and reuse that from strings_typing/lowering more systematically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great explanation! That makes sense, and the new way is certainly simpler without the old extra package.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 9, 2023

This is quite a large PR. Can we split it into smaller PRs somehow?

@bdice
Copy link
Contributor

bdice commented Feb 9, 2023

This is quite a large PR. Can we split it into smaller PRs somehow?

Most of this PR is a pure move, but the pieces are pretty intertwined and it would be hard to break up. The code has been previously reviewed by several folks, so it's all familiar and shouldn't be hard to review.

@brandon-b-miller brandon-b-miller marked this pull request as ready for review February 10, 2023 14:31
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packaging questions. Changes are probably needed.

ci/test_python_other.sh Show resolved Hide resolved
- cudf ={{ version }}
- {{ pin_compatible('cudatoolkit', max_pin='x', min_pin='x') }}
- cachetools
- ptxcompiler >=0.7.0 # CUDA enhanced compatibility. See https://github.com/rapidsai/ptxcompiler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't require ptxcompiler >=0.7.0 in the cudf recipe. We should, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this. Currently this is being inherited through cubinlinker.



strings_ptx_file = _get_ptx_file(os.path.dirname(__file__), "shim_")
ptx_files.append(strings_ptx_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just initialize ptx_files with this value above? We don't need to initialize-and-modify (append).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this a bit in c304f5c .

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for your iterations, @brandon-b-miller.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change then I think this is good to go.

python/cudf/udf_cpp/CMakeLists.txt Outdated Show resolved Hide resolved
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit f90ae52 into rapidsai:branch-23.04 Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change CMake CMake build issue feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants