Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate string slice APIs to pylibcudf #15988

Merged

Conversation

brandon-b-miller
Copy link
Contributor

This PR introduces pylibcudf string slice APIs and migrates the cuDF cython to use them. Part of #15162

@brandon-b-miller brandon-b-miller requested a review from a team as a code owner June 12, 2024 02:37
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jun 12, 2024
@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change labels Jun 12, 2024
Copy link
Member

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brandon-b-miller! Some small drive-by comments for now:

python/cudf/cudf/_lib/pylibcudf/strings/slice.pyx Outdated Show resolved Hide resolved
from cudf._lib.pylibcudf.libcudf.scalar.scalar_factories cimport (
make_fixed_width_scalar as cpp_make_fixed_width_scalar,
)
from cudf._lib.pylibcudf.libcudf.strings cimport substring as cpp_slice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking this PR, but wonder if it makes sense to align the libcudf/pylibcudf structure on either substring or slice rather than using both interchangeably (though don't have much context with strings submodules to know if this was an intentional choice)

python/cudf/cudf/_lib/strings/substring.pyx Show resolved Hide resolved
@brandon-b-miller
Copy link
Contributor Author

There's a failing test here that could be coming from libcudf - I get different answers for a string with just one space in it depending on if I use the column wise or scalar APIs:

import pyarrow as pa
import cudf._lib.pylibcudf as plc

col = plc.interop.from_arrow(
    pa.array(
        [
            " "
        ]
    )
)

slr_start = plc.interop.from_arrow(pa.scalar(-1))
slr_stop = plc.interop.from_arrow(pa.scalar(-1))

res = plc.strings.slice.slice_strings(col, slr_start, slr_stop)
print(plc.interop.to_arrow(res))

col_start = plc.interop.from_arrow(
    pa.array(
        [
            -1
        ]
    )
)

col_stop = plc.interop.from_arrow(
    pa.array(
        [
            -1
        ]
    )
)

res = plc.strings.slice.slice_strings(col, col_start, col_stop)

print(plc.interop.to_arrow(res))

This prints

[
  [
    ""
  ]
]
[
  [
    " "
  ]
]

Should we expect do get a different result for slice(" ", start=-1, stop=-1) depending on if the -1's come from a scalar or column?

@brandon-b-miller
Copy link
Contributor Author

Should we expect do get a different result for slice(" ", start=-1, stop=-1) depending on if the -1's come from a scalar or column?

cc maybe @davidwendt for the above

@davidwendt
Copy link
Contributor

davidwendt commented Jun 24, 2024

The scalar version does operate different from the columnar version when negative values are involved.
Documentation here does not describe this: https://docs.rapids.ai/api/libcudf/stable/group__strings__slice#gaa6f853a3b807391dadad69ad8727bac0

Negative values for the scalar version do a wrap around to be consistent with the pandas string slicing.
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.slice.html
The columnar version does not support negative values in this way. You can pass negative values for the stops column to indicate to-the-end-of-the-string. Otherwise, out-of-bounds position values will result in an empty string.

@vyasr
Copy link
Contributor

vyasr commented Jun 24, 2024

Should we update the docs in this PR? That seems like the main action item if the test behavior is correct.

@brandon-b-miller
Copy link
Contributor Author

Thanks @davidwendt for the context I managed to handle the edge case in 78fe267. All tests should pass here now.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small queries

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brandon

@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e0b8ab0 into rapidsai:branch-24.08 Jun 26, 2024
81 checks passed
@brandon-b-miller brandon-b-miller deleted the pylibcudf-strings-slice branch June 26, 2024 23:07
rapids-bot bot pushed a commit that referenced this pull request Jul 3, 2024
This PR plumbs the libcudf/pylibcudf `slice_strings` function through to cudf-polars. Depends on #15988

Authors:
  - https://github.com/brandon-b-miller
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #16082
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants