-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move strings_udf
code into cuDF
#12669
Move strings_udf
code into cuDF
#12669
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so nice! I have a few minor comments, some of which I think were deferred from the original strings_udf
PR.
I assumed that most of the code changes in this PR were pure moves, and I didn't look too closely at the things that seemed familiar.
python/cudf/cudf/core/udf/utils.py
Outdated
heap_size = 0 | ||
cudf_str_dtype = dtype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be public? The second one looks like it should be a module constant.
heap_size = 0 | |
cudf_str_dtype = dtype(str) | |
_heap_size = 0 | |
_CUDF_STR_DTYPE = dtype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, more generally this module has a lot of module-level vars that should probably be internal and prefixed with underscores.
from cudf.core.udf._ops import comparison_ops | ||
from cudf.core.udf.masked_typing import MaskedType | ||
# libcudf size_type | ||
size_type = types.int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this definition next to where we hardcode other libcudf information like size_type_dtype
(or reuse that declaration, if possible).
cudf/python/cudf/cudf/_lib/types.pyx
Line 20 in 6a67e8f
size_type_dtype = np.dtype("int32") |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.04 #12669 +/- ##
===============================================
Coverage ? 85.80%
===============================================
Files ? 154
Lines ? 25128
Branches ? 0
===============================================
Hits ? 21561
Misses ? 3567
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@brandon-b-miller Can we rename this PR? “Sunset” sounds like a deprecation to me, but we’re doing a hard break because the feature was experimental. I’d propose “Move strings_udf into cudf package” or similar. I also labeled this as |
@brandon-b-miller Also can we file (or update) a companion issue that tracks any needed changes to resources outside this repo like blogs, RAPIDS website docs, etc. that tell users to install strings_udf? |
strings_udf
packagestrings_udf
code into cuDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are still some references to strings_udf in this repo based on a git grep. Let me know if you need help tracking everything down.
Excited to see this happening! Love to see a -3500 LoC PR.
python/cudf/cudf/core/udf/utils.py
Outdated
heap_size = 0 | ||
cudf_str_dtype = dtype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, more generally this module has a lot of module-level vars that should probably be internal and prefixed with underscores.
|
||
|
||
# Strings functions and utilities | ||
def _is_valid_string_arg(ty): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. It looks like the contents of strings_udf/_typing was moved into core/udf/strings_typing and then the old contents of strings_typing were moved into this file. Could you comment on the rationale for the current separation? Is there a distinction between strings and nullable strings logic being made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! This is a great question. Our MaskedType
extension is the type we use to carry around a value and a validity. When using DataFrame.apply
or Series.apply
, the scalar inside your UDF are really MaskedType
:
df = cudf.DataFrame(
{
'a':[1, 2, 3],
'b':['a','b','c'] # one string column 'b'
})
def f(row):
x = row['a'] # MaskedType(int64)
y = row['b'] # MaskedType(string_view)
z = y.upper() # MaskedType(udf_string), not used just for demo
return x + len(y)
I always thought the cleanest way of putting this together was to have the string_view
and udf_string
types actually implement the string methods, like upper
or len
, whereas the MaskedType
that carries it around holds the nullability logic. At that point calling something like len(MaskedType(string_view))
is programmed in the lowering to translate roughly to:
len(MaskedType(string_view, valid=True)) = MaskedType(len(string_view), valid=True)
meaning it ends up taking the validity of the source. That would make our strings just another type of scalar that could be masked.
The requirement of the external package made a lot of this way harder though since we needed a lot of this to be optional. Before this PR we had this happening:
- string scalars and their ops defined in
strings_udf
- Masked string operations defined in
cudf
(instrings_typing.py
andstrings_lowering.py
deliberately in a separate place so that it could be optionally imported/registered cudf
optionally importingstrings_typing
andstrings_lowering
which contained all the overloads for masked strings
Now what we have is:
- string scalars and their ops defined in
strings_typing.py
andstrings_lowering.py
- Masked string operations defined in
masked_typing.py
andmasked_lowering.py
- Nothing is optional
I think the second way is a lot cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idle thought, not for implementation in this PR, is there a way we can take advantage of what looks to be the functorial nature of the MaskedType
object to simplify the code and reuse that from strings_typing
/lowering
more systematically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great explanation! That makes sense, and the new way is certainly simpler without the old extra package.
This is quite a large PR. Can we split it into smaller PRs somehow? |
Most of this PR is a pure move, but the pieces are pretty intertwined and it would be hard to break up. The code has been previously reviewed by several folks, so it's all familiar and shouldn't be hard to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packaging questions. Changes are probably needed.
- cudf ={{ version }} | ||
- {{ pin_compatible('cudatoolkit', max_pin='x', min_pin='x') }} | ||
- cachetools | ||
- ptxcompiler >=0.7.0 # CUDA enhanced compatibility. See https://github.com/rapidsai/ptxcompiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't require ptxcompiler >=0.7.0
in the cudf
recipe. We should, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this. Currently this is being inherited through cubinlinker
.
python/cudf/cudf/core/udf/utils.py
Outdated
|
||
|
||
strings_ptx_file = _get_ptx_file(os.path.dirname(__file__), "shim_") | ||
ptx_files.append(strings_ptx_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we just initialize ptx_files
with this value above? We don't need to initialize-and-modify (append).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this a bit in c304f5c .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks for your iterations, @brandon-b-miller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change then I think this is good to go.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
…er/cudf into sunset-stringsudf-package
/merge |
With the merge of #11452 we have the machinery to build and deploy PTX libraries of shim functions as part of cuDF's build process. With this there is no reason to keep the
strings_udf
code separate anymore. This PR removes the separate package and all of it's related CI plumbing as well as supports the strings feature by default, just like GroupBy.