-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate string slice
APIs to pylibcudf
#15988
Migrate string slice
APIs to pylibcudf
#15988
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brandon-b-miller! Some small drive-by comments for now:
from cudf._lib.pylibcudf.libcudf.scalar.scalar_factories cimport ( | ||
make_fixed_width_scalar as cpp_make_fixed_width_scalar, | ||
) | ||
from cudf._lib.pylibcudf.libcudf.strings cimport substring as cpp_slice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking this PR, but wonder if it makes sense to align the libcudf/pylibcudf structure on either substring
or slice
rather than using both interchangeably (though don't have much context with strings submodules to know if this was an intentional choice)
There's a failing test here that could be coming from libcudf - I get different answers for a string with just one space in it depending on if I use the column wise or scalar APIs: import pyarrow as pa
import cudf._lib.pylibcudf as plc
col = plc.interop.from_arrow(
pa.array(
[
" "
]
)
)
slr_start = plc.interop.from_arrow(pa.scalar(-1))
slr_stop = plc.interop.from_arrow(pa.scalar(-1))
res = plc.strings.slice.slice_strings(col, slr_start, slr_stop)
print(plc.interop.to_arrow(res))
col_start = plc.interop.from_arrow(
pa.array(
[
-1
]
)
)
col_stop = plc.interop.from_arrow(
pa.array(
[
-1
]
)
)
res = plc.strings.slice.slice_strings(col, col_start, col_stop)
print(plc.interop.to_arrow(res)) This prints
Should we expect do get a different result for |
cc maybe @davidwendt for the above |
The scalar version does operate different from the columnar version when negative values are involved. Negative values for the scalar version do a wrap around to be consistent with the pandas string slicing. |
Should we update the docs in this PR? That seems like the main action item if the test behavior is correct. |
Thanks @davidwendt for the context I managed to handle the edge case in 78fe267. All tests should pass here now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small queries
Co-authored-by: Lawrence Mitchell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brandon
/merge |
This PR plumbs the libcudf/pylibcudf `slice_strings` function through to cudf-polars. Depends on #15988 Authors: - https://github.com/brandon-b-miller - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #16082
This PR introduces pylibcudf string
slice
APIs and migrates the cuDF cython to use them. Part of #15162