-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate string slice
APIs to pylibcudf
#15988
Merged
rapids-bot
merged 18 commits into
rapidsai:branch-24.08
from
brandon-b-miller:pylibcudf-strings-slice
Jun 26, 2024
Merged
Changes from 9 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
12a6b2c
building
brandon-b-miller 2c33cf7
plumbing to cuDF cython
brandon-b-miller 65e0011
add tests
brandon-b-miller a6e9ed2
excize improper pyarrow function
brandon-b-miller e52079c
column tests
brandon-b-miller d455908
add tests
brandon-b-miller 8e08880
docs
brandon-b-miller 5d223b8
cleanup
brandon-b-miller c53d07a
Merge branch 'branch-24.08' into pylibcudf-strings-slice
galipremsagar 8605bd2
Merge branch 'branch-24.08' into pylibcudf-strings-slice
brandon-b-miller 866fefe
Merge branch 'branch-24.08' into pylibcudf-strings-slice
brandon-b-miller fb1a0a1
address some reviews
brandon-b-miller 238a583
Merge branch 'branch-24.08' into pylibcudf-strings-slice
brandon-b-miller 9d70bb2
Merge branch 'branch-24.08' into pylibcudf-strings-slice
brandon-b-miller 78fe267
slice edge case
brandon-b-miller 8dcdb2c
Apply suggestions from code review
brandon-b-miller dd9fdea
edge cases, tests
brandon-b-miller 82d6569
Merge branch 'branch-24.08' into pylibcudf-strings-slice
brandon-b-miller File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,4 @@ strings | |
|
||
contains | ||
replace | ||
slice |
6 changes: 6 additions & 0 deletions
6
docs/cudf/source/user_guide/api_docs/pylibcudf/strings/slice.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
===== | ||
slice | ||
===== | ||
|
||
.. automodule:: cudf._lib.pylibcudf.strings.slice | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,4 +9,5 @@ from . cimport ( | |
regex_flags, | ||
regex_program, | ||
replace, | ||
slice, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,4 +9,5 @@ | |
regex_flags, | ||
regex_program, | ||
replace, | ||
slice, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
|
||
from cudf._lib.pylibcudf.column cimport Column | ||
from cudf._lib.pylibcudf.scalar cimport Scalar | ||
|
||
ctypedef fused ColumnOrScalar: | ||
Column | ||
Scalar | ||
|
||
cpdef Column slice_strings( | ||
Column input, | ||
ColumnOrScalar start=*, | ||
ColumnOrScalar stop=*, | ||
Scalar step=* | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
|
||
from libcpp.memory cimport unique_ptr | ||
from libcpp.utility cimport move | ||
|
||
from cudf._lib.pylibcudf.column cimport Column | ||
from cudf._lib.pylibcudf.libcudf.column.column cimport column | ||
from cudf._lib.pylibcudf.libcudf.scalar.scalar cimport numeric_scalar | ||
from cudf._lib.pylibcudf.libcudf.scalar.scalar_factories cimport ( | ||
make_fixed_width_scalar as cpp_make_fixed_width_scalar, | ||
) | ||
from cudf._lib.pylibcudf.libcudf.strings cimport substring as cpp_slice | ||
from cudf._lib.pylibcudf.libcudf.types cimport size_type | ||
from cudf._lib.pylibcudf.scalar cimport Scalar | ||
|
||
from cython.operator import dereference | ||
|
||
|
||
cpdef Column slice_strings( | ||
Column input, | ||
wence- marked this conversation as resolved.
Show resolved
Hide resolved
|
||
ColumnOrScalar start=None, | ||
ColumnOrScalar stop=None, | ||
wence- marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Scalar step=None | ||
): | ||
"""Perform a slice operation on a strings column. | ||
|
||
``start`` and ``stop`` may be a | ||
:py:class:`~cudf._lib.pylibcudf.column.Column` or a | ||
:py:class:`~cudf._lib.pylibcudf.scalar.Scalar`. But ``step`` must be a | ||
:py:class:`~cudf._lib.pylibcudf.scalar.Scalar`. | ||
|
||
For details, see :cpp:func:`cudf::strings::slice_strings`. | ||
|
||
Parameters | ||
---------- | ||
input : Column | ||
Strings column for this operation | ||
start : Union[Column, Scalar] | ||
The start character position or positions. | ||
stop : Union[Column, Scalar] | ||
The end character position or positions | ||
step : Scalar | ||
Distance between input characters retrieved | ||
|
||
Returns | ||
------- | ||
pylibcudf.Column | ||
The result of the slice operation | ||
""" | ||
cdef unique_ptr[column] c_result | ||
cdef numeric_scalar[size_type]* cpp_start | ||
cdef numeric_scalar[size_type]* cpp_stop | ||
cdef numeric_scalar[size_type]* cpp_step | ||
|
||
if ColumnOrScalar is Column: | ||
if step is not None: | ||
raise ValueError("Column-wise slice does not support step") | ||
|
||
with nogil: | ||
c_result = cpp_slice.slice_strings( | ||
input.view(), | ||
start.view(), | ||
stop.view() | ||
brandon-b-miller marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
|
||
elif ColumnOrScalar is Scalar: | ||
if start is None: | ||
delimiters = Scalar.from_libcudf( | ||
cpp_make_fixed_width_scalar(0) | ||
) | ||
if stop is None: | ||
delimiters = Scalar.from_libcudf( | ||
cpp_make_fixed_width_scalar(0) | ||
) | ||
if step is None: | ||
delimiters = Scalar.from_libcudf( | ||
cpp_make_fixed_width_scalar(1) | ||
) | ||
brandon-b-miller marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
cpp_start = <numeric_scalar[size_type]*>start.c_obj.get() | ||
cpp_stop = <numeric_scalar[size_type]*>stop.c_obj.get() | ||
cpp_step = <numeric_scalar[size_type]*>step.c_obj.get() | ||
|
||
brandon-b-miller marked this conversation as resolved.
Show resolved
Hide resolved
|
||
with nogil: | ||
c_result = cpp_slice.slice_strings( | ||
input.view(), | ||
dereference(cpp_start), | ||
dereference(cpp_stop), | ||
dereference(cpp_step) | ||
) | ||
else: | ||
raise ValueError("start, stop, and step must be either Column or Scalar") | ||
|
||
return Column.from_libcudf(move(c_result)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
|
||
import pyarrow as pa | ||
import pytest | ||
from utils import assert_column_eq | ||
|
||
import cudf._lib.pylibcudf as plc | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def pa_col(): | ||
return pa.array(["AbC", "123abc", "", " ", None]) | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def plc_col(pa_col): | ||
return plc.interop.from_arrow(pa_col) | ||
|
||
|
||
@pytest.fixture( | ||
scope="module", | ||
params=[(1, 3, 1), (0, 3, -1), (3, 2, 1), (1, 5, 5), (1, 100, 2)], | ||
) | ||
def pa_start_stop_step(request): | ||
return tuple(pa.scalar(x, type=pa.int32()) for x in request.param) | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def plc_start_stop_step(pa_start_stop_step): | ||
return tuple(plc.interop.from_arrow(x) for x in pa_start_stop_step) | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def pa_starts_col(): | ||
return pa.array([0, 1, 3, -1, 100]) | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def plc_starts_col(pa_starts_col): | ||
return plc.interop.from_arrow(pa_starts_col) | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def pa_stops_col(): | ||
return pa.array([1, 3, 4, -1, 100]) | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def plc_stops_col(pa_stops_col): | ||
return plc.interop.from_arrow(pa_stops_col) | ||
|
||
|
||
def test_slice(pa_col, plc_col, pa_start_stop_step, plc_start_stop_step): | ||
pa_start, pa_stop, pa_step = pa_start_stop_step | ||
plc_start, plc_stop, plc_step = plc_start_stop_step | ||
|
||
def slice_string(st, start, stop, step): | ||
return st[start:stop:step] if st is not None else None | ||
|
||
expected = pa.array( | ||
[ | ||
slice_string(x, pa_start.as_py(), pa_stop.as_py(), pa_step.as_py()) | ||
for x in pa_col.to_pylist() | ||
], | ||
type=pa.string(), | ||
) | ||
|
||
got = plc.strings.slice.slice_strings( | ||
plc_col, start=plc_start, stop=plc_stop, step=plc_step | ||
) | ||
|
||
assert_column_eq(expected, got) | ||
|
||
|
||
def test_slice_column( | ||
pa_col, plc_col, pa_starts_col, plc_starts_col, pa_stops_col, plc_stops_col | ||
): | ||
def slice_string(st, start, stop): | ||
return st[start:stop] if st is not None else None | ||
|
||
expected = pa.array( | ||
[ | ||
slice_string(x, start, stop) | ||
for x, start, stop in zip( | ||
pa_col.to_pylist(), | ||
pa_starts_col.to_pylist(), | ||
pa_stops_col.to_pylist(), | ||
) | ||
], | ||
type=pa.string(), | ||
) | ||
|
||
got = plc.strings.slice.slice_strings( | ||
plc_col, plc_starts_col, plc_stops_col | ||
) | ||
|
||
assert_column_eq(expected, got) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking this PR, but wonder if it makes sense to align the libcudf/pylibcudf structure on either
substring
orslice
rather than using both interchangeably (though don't have much context with strings submodules to know if this was an intentional choice)