Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Ability to control the amount of temporary memory used for regex expressions #10852

Open
jlowe opened this issue May 13, 2022 · 6 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@jlowe
Copy link
Member

jlowe commented May 13, 2022

Is your feature request related to a problem? Please describe.
Regular expression processing can require a significant amount of temporary memory. The RAPIDS Accelerator for Apache Spark needs the ability to control how much GPU memory is used for these operations in order to avoid excessive spilling or GPU out of memory errors when the user provides a particularly complicated regex pattern and/or large input data.

Describe the solution you'd like
The libcudf regular expression APIs accept an optional parameter to be specified which is an upper bound on the amount of temporary GPU memory to use for regular expression processing. If the value is below the "natural" size for full concurrency, the algorithm would reduce the concurrency to fit within the memory bound. I would expect there would be a lower-limit below which regex processing would not be possible within the requested memory limit.

Describe alternatives you've considered
Instead of APIs focused on limiting memory there could be APIs to report what will be used without the ability to control it, such as the one implemented in #10808. This type of API does not allow the caller to tradeoff between GPU memory usage and GPU performance, as it either will fit in GPU memory or it won't. If reported as too big the RAPIDS Accelerator would be forced to fallback to the CPU to perform the regex processing (with the requisite columnar to row formatted data transform and back).

The RAPIDS Accelerator currently does not support falling back to the CPU after query planning has completed on the Spark driver (which does not have a GPU), and the query planning does not have access to the string data to search (only the regex pattern to use). Even with a memory size reporting API, without the input data the API would have to be a worst-case estimate that could cause an unnecessary fallback to the CPU.

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels May 13, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@revans2
Copy link
Contributor

revans2 commented Jun 14, 2022

This is still wanted

@davidwendt
Copy link
Contributor

@revans2 And what about this one? #10808 (comment)

@GregoryKimball
Copy link
Contributor

Closing after discussion in #10808

@davidwendt
Copy link
Contributor

The #10808 was the API that I built that was not useful. This feature describes the actual desired API from Spark.

@davidwendt davidwendt reopened this Jun 28, 2022
@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 29, 2022
@davidwendt davidwendt self-assigned this Jul 15, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Nov 9, 2022
Adds a new `regex_program` class to encapsulate a regex pattern and parameters used for executing regex calls on strings columns in libcudf. This provides a single object to hold the regex settings rather than adding or updating parameters to every call. Given a pattern (and other settings), it will _compile_ and validate the pattern and build the set of instructions/commands needed to execute the regex on a strings column. Converting the pattern is done in CPU code. The object contains no state data and can be reused on the same API or other similar calls as appropriate (per the settings).
The object can also be queried to help with resource allocation/expectations.

The main files to review are the new `regex_program*` source files plus the corresponding changes in `regexec.cpp` (renamed from .cu). The remainder are simply side-effects and have common patterns to use the new object.
No function or behavior has changed but rather an new interface has been added over existing function but additional tests have been added to exercise through the companion APIs.

Currently, all regex APIs are duplicated -- the original API plus a new one accepting a `regex_progam` object. Once accepted we may consider deprecating the non-object APIs and then removing them in a future release.

This will help with changes needed for #10852

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: #11927
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

5 participants