Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate string replace.pxd to pylibcudf #15839

Merged
merged 14 commits into from
Jun 5, 2024

Conversation

lithomas1
Copy link
Contributor

@lithomas1 lithomas1 commented May 23, 2024

Description

xref #15162

Change replace.pxd to use pylibcudf APIs.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue labels May 23, 2024
set(linked_libraries cudf::cudf)
rapids_cython_create_modules(
CXX
SOURCE_FILES "${cython_sources}"
LINKED_LIBRARIES "${linked_libraries}" MODULE_PREFIX pylibcudf_ ASSOCIATED_TARGETS cudf
LINKED_LIBRARIES "${linked_libraries}" MODULE_PREFIX pylibcudf_strings ASSOCIATED_TARGETS cudf
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed since the string replace.pyx clashes with the regular replace.pyx

))
else:
# Column case
# TODO: maxrepl should be supported in the corresponding CUDA/C++ code
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the overload of replace in libcudf where input/target/repl are columns, there isn't a maxrepl arg.

We should probably support this in libcudf replace (eventually), otherwise we'll have some weirdness in pylibcudf where we'll have to raise for maxrepl despite accepting it as an argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Can you raise an issue?

In the meantime, I would recommend that we change the default value of the parameter to None, then raise a NotImplementedError in this branch of the code if we find a non-None value, while in the Scalar branch we set it to -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lithomas1 lithomas1 marked this pull request as ready for review May 23, 2024 19:00
@lithomas1 lithomas1 requested a review from a team as a code owner May 23, 2024 19:00
@lithomas1
Copy link
Contributor Author

probably should go in after #15503

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor nits to pick, but overall looks great!

python/cudf/cudf/_lib/pylibcudf/strings/CMakeLists.txt Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/strings/find.pxd Outdated Show resolved Hide resolved
))
else:
# Column case
# TODO: maxrepl should be supported in the corresponding CUDA/C++ code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Can you raise an issue?

In the meantime, I would recommend that we change the default value of the parameter to None, then raise a NotImplementedError in this branch of the code if we find a non-None value, while in the Scalar branch we set it to -1.

python/cudf/cudf/_lib/pylibcudf/strings/replace.pyx Outdated Show resolved Hide resolved
@pytest.fixture(scope="module", params=["a", "c", "A", "Á", "aa", "ÁÁÁ"])
def scalar_repl_target(request):
pa_target = pa.scalar(request.param, type=pa.string())
return (request.param, plc.interop.from_arrow(pa_target))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach. This accomplishes the same thing in spirit that the other tests have done with a plc fixture that's constructed from the arrow fixture, just in a slightly different way. The more I stare at it, I'm starting to prefer this approach a little since it ties the two objects very tightly together in a way that reflects that they'll probably always be used together in tests. I suppose it's largely a matter of taste though. @mroeschke @brandon-b-miller WDYT? I would generally prefer to be consistent across tests; it's not a huge deal, but given that our test suite is still fairly small I'd be fine with a quick PR to standardize on this approach if we like it better (or change this to the separate fixture approach if we prefer that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we do in pandas.

The advantage of this approach is being able to pair the values together.
(which is good since you'll never have cases where the fixtures get out of sync)

The disadvantage is that you have to add a line unpacking the tuple in every test that uses the fixture.

I'm happy with either, just did it this way since it was more familiar for me.

Copy link
Contributor

@vyasr vyasr Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think I prefer this approach. After this PR merges, could you make a follow-up that standardizes this in other tests? Basically just removing pairs of (pyarrow,pylibcudf) fixtures in favor of a single fixture returning the pair? You can also pair up the column fixtures in this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@vyasr
Copy link
Contributor

vyasr commented May 28, 2024

I'm holding off on a second review until #15503 is merged, but in the meantime could you add a short PR description please?

@vyasr vyasr added the pylibcudf Issues specific to the pylibcudf package label May 28, 2024
@lithomas1
Copy link
Contributor Author

Lets wait for #15898

@lithomas1 lithomas1 added the 5 - DO NOT MERGE Hold off on merging; see PR for details label May 31, 2024
@vyasr
Copy link
Contributor

vyasr commented Jun 3, 2024

#15898 is merged

@lithomas1 lithomas1 removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jun 3, 2024
@lithomas1
Copy link
Contributor Author

OK, this is re-sync'ed and ready for re-review.

@lithomas1 lithomas1 requested a review from vyasr June 4, 2024 20:56
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small suggestions and a request to propagate the change for testing fixtures forward. Otherwise LGTM!

docs/cudf/source/user_guide/api_docs/pylibcudf/index.rst Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/strings/replace.pyx Outdated Show resolved Hide resolved
@lithomas1
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit db1b365 into rapidsai:branch-24.08 Jun 5, 2024
69 checks passed
@lithomas1 lithomas1 deleted the pylibcudf-replace branch June 5, 2024 16:48
@lithomas1 lithomas1 mentioned this pull request Jun 8, 2024
3 tasks
rapids-bot bot pushed a commit that referenced this pull request Jun 11, 2024
Condense all pa_foo/plc_foo data fixtures into just foo, as recommended by
#15839 (comment).

Authors:
  - Thomas Li (https://github.com/lithomas1)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants