[FEA] Ability to control the amount of temporary memory used for regex expressions #10852

jlowe · 2022-05-13T19:12:00Z

Is your feature request related to a problem? Please describe.
Regular expression processing can require a significant amount of temporary memory. The RAPIDS Accelerator for Apache Spark needs the ability to control how much GPU memory is used for these operations in order to avoid excessive spilling or GPU out of memory errors when the user provides a particularly complicated regex pattern and/or large input data.

Describe the solution you'd like
The libcudf regular expression APIs accept an optional parameter to be specified which is an upper bound on the amount of temporary GPU memory to use for regular expression processing. If the value is below the "natural" size for full concurrency, the algorithm would reduce the concurrency to fit within the memory bound. I would expect there would be a lower-limit below which regex processing would not be possible within the requested memory limit.

Describe alternatives you've considered
Instead of APIs focused on limiting memory there could be APIs to report what will be used without the ability to control it, such as the one implemented in #10808. This type of API does not allow the caller to tradeoff between GPU memory usage and GPU performance, as it either will fit in GPU memory or it won't. If reported as too big the RAPIDS Accelerator would be forced to fallback to the CPU to perform the regex processing (with the requisite columnar to row formatted data transform and back).

The RAPIDS Accelerator currently does not support falling back to the CPU after query planning has completed on the Spark driver (which does not have a GPU), and the query planning does not have access to the string data to search (only the regex pattern to use). Even with a memory size reporting API, without the input data the API would have to be a worst-case estimate that could cause an unnecessary fallback to the CPU.

github-actions · 2022-06-12T20:03:41Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

revans2 · 2022-06-14T14:32:21Z

This is still wanted

davidwendt · 2022-06-14T14:45:05Z

@revans2 And what about this one? #10808 (comment)

GregoryKimball · 2022-06-28T05:32:12Z

Closing after discussion in #10808

davidwendt · 2022-06-28T12:20:16Z

The #10808 was the API that I built that was not useful. This feature describes the actual desired API from Spark.

github-actions · 2022-08-14T21:03:14Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Adds a new `regex_program` class to encapsulate a regex pattern and parameters used for executing regex calls on strings columns in libcudf. This provides a single object to hold the regex settings rather than adding or updating parameters to every call. Given a pattern (and other settings), it will _compile_ and validate the pattern and build the set of instructions/commands needed to execute the regex on a strings column. Converting the pattern is done in CPU code. The object contains no state data and can be reused on the same API or other similar calls as appropriate (per the settings). The object can also be queried to help with resource allocation/expectations. The main files to review are the new `regex_program*` source files plus the corresponding changes in `regexec.cpp` (renamed from .cu). The remainder are simply side-effects and have common patterns to use the new object. No function or behavior has changed but rather an new interface has been added over existing function but additional tests have been added to exercise through the companion APIs. Currently, all regex APIs are duplicated -- the original API plus a new one accepting a `regex_progam` object. Once accepted we may consider deprecating the non-object APIs and then removing them in a future release. This will help with changes needed for #10852 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: #11927

jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels May 13, 2022

jlowe mentioned this issue May 13, 2022

Add cudf::strings::compute_regex_state_memory API #10808

Closed

github-actions bot added the inactive-30d label Jun 12, 2022

github-actions bot removed the inactive-30d label Jun 14, 2022

NVnavkumar mentioned this issue Jun 21, 2022

[FEA] Validate the size/complexity of regular expressions NVIDIA/spark-rapids#4061

Closed

GregoryKimball closed this as completed Jun 28, 2022

davidwendt reopened this Jun 28, 2022

GregoryKimball removed the Needs Triage Need team to review and classify label Jun 29, 2022

davidwendt self-assigned this Jul 15, 2022

github-actions bot added the inactive-30d label Aug 14, 2022

davidwendt mentioned this issue Oct 14, 2022

Add regex_program class for use with all regex APIs #11927

Merged

3 tasks

vyasr removed the inactive-30d label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Ability to control the amount of temporary memory used for regex expressions #10852

[FEA] Ability to control the amount of temporary memory used for regex expressions #10852

jlowe commented May 13, 2022

github-actions bot commented Jun 12, 2022

revans2 commented Jun 14, 2022

davidwendt commented Jun 14, 2022

GregoryKimball commented Jun 28, 2022

davidwendt commented Jun 28, 2022

github-actions bot commented Aug 14, 2022

[FEA] Ability to control the amount of temporary memory used for regex expressions #10852

[FEA] Ability to control the amount of temporary memory used for regex expressions #10852

Comments

jlowe commented May 13, 2022

github-actions bot commented Jun 12, 2022

revans2 commented Jun 14, 2022

davidwendt commented Jun 14, 2022

GregoryKimball commented Jun 28, 2022

davidwendt commented Jun 28, 2022

github-actions bot commented Aug 14, 2022