Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support strip, lstrip, and rstrip in strings_udf #12091

Merged
merged 66 commits into from
Nov 10, 2022

Conversation

brandon-b-miller
Copy link
Contributor

This PR adds support for the following three functions in strings_udf:

  • str.strip(other)
  • str.lstrip(other)
  • str.rstrip(other)

Part of #9639

@brandon-b-miller brandon-b-miller added feature request New feature or request 2 - In Progress Currently a work in progress numba Numba issue Python Affects Python cuDF API. non-breaking Non-breaking change labels Nov 8, 2022
@brandon-b-miller brandon-b-miller self-assigned this Nov 8, 2022
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner November 8, 2022 15:47
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions for improvement. Feel free to merge once you've addressed, I don't need to look again unless something new comes up (in which case feel free to re-request).

@@ -229,6 +229,7 @@ def resolve_count(self, mod):
"isnumeric",
"istitle",
]
string_binary_funcs = ["strip", "lstrip", "rstrip"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is somewhat misleading. There are other string binary functions that we already have implemented, including operators and things like find and contains. Is this list meant to contain binary string functions that also return a string, or is it even more specific than that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming this to string_return_attrs

python/strings_udf/strings_udf/lowering.py Outdated Show resolved Hide resolved
@vyasr
Copy link
Contributor

vyasr commented Nov 8, 2022

Are tests failing because there's somewhere extra that needs to have the new method attributes added?

E                   numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
E                   #x1B[1m#x1B[1mUnknown attribute 'strip' of type Masked(Poison<#x1B[91m
E                   strings_udf library required for usage of string dtypes inside user defined functions.
E                   #x1B[0m>)
E                   #x1B[1m
E                   File "tests/test_udf_masked_ops.py", line 882:#x1B[0m
E                   #x1B[1m    def func(row):
E                   #x1B[1m        return row["str_col"].strip(strip_char)
E                   #x1B[0m        #x1B[1m^#x1B[0m#x1B[0m
E                   #x1B[0m

@brandon-b-miller
Copy link
Contributor Author

brandon-b-miller commented Nov 9, 2022

Are tests failing because there's somewhere extra that needs to have the new method attributes added?

I think this is just because those tests need to be marked @string_udf_test.

@brandon-b-miller
Copy link
Contributor Author

@davidwendt I'm getting one failing test locally for the following string using:

>>> import cudf
>>> sr = cudf.Series(['.123a'])
>>> def f(st):
...     return st.lstrip('a')
... 
>>> sr.apply(f)
0    .123
dtype: object

It looks like this rstrip'ed the string rather than lstrip'ed it. Interestingly the reverse is also a fail case:

>>> import cudf
>>> sr = cudf.Series(['.123a'])
>>> def f(st):
...     return st.rstrip('.')
... 
>>> sr.apply(f)
0    123a
dtype: object
>>> 

Here it looks like this was lstrip'ed instead of rstrip'ed.

@davidwendt
Copy link
Contributor

davidwendt commented Nov 9, 2022

Here it looks like this was lstrip'ed instead of rstrip'ed.

I'm not seeing this error in my unit tests

  auto input = cudf::string_view(".123a",5);
  auto achr = cudf::string_view("a",1);
  auto dot = cudf::string_view(".",1);

  d_str = cudf::strings::udf::strip( input, achr, cudf::strings::side_type::LEFT );
  printf("lstrip(.123a,a)=[%s]\n", d_str.data());
  d_str = cudf::strings::udf::strip( input, dot, cudf::strings::side_type::LEFT );
  printf("lstrip(.123a,.)=[%s]\n", d_str.data());
  d_str = cudf::strings::udf::strip( input, achr, cudf::strings::side_type::RIGHT );
  printf("rstrip(.123a,a)=[%s]\n", d_str.data());
  d_str = cudf::strings::udf::strip( input, dot, cudf::strings::side_type::RIGHT );
  printf("rstrip(.123a,.)=[%s]\n", d_str.data());
lstrip(.123a,a)=[.123a]
lstrip(.123a,.)=[123a]
rstrip(.123a,a)=[.123]
rstrip(.123a,.)=[.123a]

Could the problem be that all of these are calling just strip() ?

_string_view_strip = cuda.declare_device(
"strip", types.int32(_UDF_STRING_PTR, _STR_VIEW_PTR, _STR_VIEW_PTR)
)
_string_view_lstrip = cuda.declare_device(
"strip", types.int32(_UDF_STRING_PTR, _STR_VIEW_PTR, _STR_VIEW_PTR)
)
_string_view_rstrip = cuda.declare_device(
"strip", types.int32(_UDF_STRING_PTR, _STR_VIEW_PTR, _STR_VIEW_PTR)
)

@brandon-b-miller
Copy link
Contributor Author

@davidwendt you're completely right, with 02167d3 the tests pass.

@codecov
Copy link

codecov bot commented Nov 9, 2022

Codecov Report

Base: 88.05% // Head: 88.08% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (c281108) compared to base (35077f5).
Patch coverage: 93.18% of modified lines in pull request are covered.

❗ Current head c281108 differs from pull request most recent head 3539ed7. Consider uploading reports for the commit 3539ed7 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12091      +/-   ##
================================================
+ Coverage         88.05%   88.08%   +0.02%     
================================================
  Files               135      135              
  Lines             22057    22092      +35     
================================================
+ Hits              19423    19459      +36     
+ Misses             2634     2633       -1     
Impacted Files Coverage Δ
python/strings_udf/strings_udf/lowering.py 84.97% <86.36%> (+<0.01%) ⬆️
python/cudf/cudf/core/udf/__init__.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/udf/strings_lowering.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/udf/strings_typing.py 97.26% <100.00%> (+0.07%) ⬆️
python/cudf/cudf/utils/ioutils.py 80.18% <100.00%> (+0.38%) ⬆️
python/strings_udf/strings_udf/_typing.py 97.22% <100.00%> (+0.07%) ⬆️
python/strings_udf/strings_udf/__init__.py 84.31% <0.00%> (+7.84%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@brandon-b-miller brandon-b-miller added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Nov 10, 2022
@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 59bd5c3 into rapidsai:branch-22.12 Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change numba Numba issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants