Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Enable .str.find_multiple public API #10126

Closed
galipremsagar opened this issue Jan 25, 2022 · 8 comments · Fixed by #11928
Closed

[FEA] Enable .str.find_multiple public API #10126

galipremsagar opened this issue Jan 25, 2022 · 8 comments · Fixed by #11928
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API. strings strings issues (C++ and Python)

Comments

@galipremsagar
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We currently have cudf._lib.strings.find_multiple which is an internal API. We would like to expose it via a public API similar to what we wanted to do in: #4575, this was previously dropped due to lack of ListColumn support: #4569 (comment), which shouldn't be a blocker now.

@galipremsagar galipremsagar added feature request New feature or request Python Affects Python cuDF API. strings strings issues (C++ and Python) labels Jan 25, 2022
@galipremsagar galipremsagar self-assigned this Jan 25, 2022
@bdice
Copy link
Contributor

bdice commented Jan 25, 2022

Additional context (@galipremsagar beat me to filing an issue! 😉):

The gpu-bdb query 18 code uses this private API, cudf._lib.strings.find_multiple, to search a column of string values for a list of target values:
https://github.com/rapidsai/gpu-bdb/blob/f67ea68d4b422c998d0284cc69fc82ed371c69cc/gpu_bdb/queries/q18/gpu_bdb_query_18.py#L106-L126

Describe the solution you'd like
Query 18 should be able to call a public API Series.str.find_multiple that behaves like Series.str.find, rather than a private API. The implementation in libcudf can be refactored to return a list column of integers (previously proposed by @davidwendt here: #6341 (comment)).

@davidwendt
Copy link
Contributor

davidwendt commented Jan 26, 2022

Technically you can build the lists column in the Python layer by just creating an offsets column and using it along with column returned by the find_multiple API.

s = cudf.Series([ "abc", "bcd", "c", "de" ])
t = cudf.Series(["c", "d"])
result = s.str.find_multiple(t)
# result is now [ 2, -1,   1, 2   0, -1   -1, 0 ]

offsets = [ 0, 2, 4, 6, 8 ]  # just an even sequence  [0 : len(s)*len(t) : len(t)]
# likely there is a numpy way to create this array

Build a ListColumn by using offsets and results as its child columns.

@harrism harrism changed the title [FEA] Enable .str.find_muliple public API [FEA] Enable .str.find_multiple public API Jan 26, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 28, 2022
Reference #10126 

This changes the current `cudf::strings::find_multiple` API to return a lists column instead of a flattened matrix as a column of integers. Each lists column size is equal to the size of the input strings and each row's size is equal to input targets column size.

Making this a breaking change since it changes the result of a public libcudf API.

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Robert Maynard (https://github.com/robertmaynard)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10134
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@bdice
Copy link
Contributor

bdice commented Feb 28, 2022

This API is also a candidate for a separate NLP-focused submodule (which has been discussed among cudf developers a few times but I am not aware of an issue tracking this idea). The pandas API does not have an equivalent of Series.str.find_multiple.

@github-actions
Copy link

github-actions bot commented Apr 1, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball
Copy link
Contributor

Closed by #10134

@bdice
Copy link
Contributor

bdice commented Jul 2, 2022

@GregoryKimball Re-opening this one because it’s only partially complete. The proposed Python API has not yet been added, only the C++ changes have been made. See: #10134 (review)

@bdice bdice reopened this Jul 2, 2022
rapids-bot bot pushed a commit that referenced this issue Oct 18, 2022
Resolves: #10126

This PR adds `.str.find_multiple` API.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Bradley Dice (https://github.com/bdice)

URL: #11928
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants