-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Ability to control the amount of temporary memory used for regex expressions #10852
Comments
This issue has been labeled |
This is still wanted |
@revans2 And what about this one? #10808 (comment) |
Closing after discussion in #10808 |
The #10808 was the API that I built that was not useful. This feature describes the actual desired API from Spark. |
This issue has been labeled |
Adds a new `regex_program` class to encapsulate a regex pattern and parameters used for executing regex calls on strings columns in libcudf. This provides a single object to hold the regex settings rather than adding or updating parameters to every call. Given a pattern (and other settings), it will _compile_ and validate the pattern and build the set of instructions/commands needed to execute the regex on a strings column. Converting the pattern is done in CPU code. The object contains no state data and can be reused on the same API or other similar calls as appropriate (per the settings). The object can also be queried to help with resource allocation/expectations. The main files to review are the new `regex_program*` source files plus the corresponding changes in `regexec.cpp` (renamed from .cu). The remainder are simply side-effects and have common patterns to use the new object. No function or behavior has changed but rather an new interface has been added over existing function but additional tests have been added to exercise through the companion APIs. Currently, all regex APIs are duplicated -- the original API plus a new one accepting a `regex_progam` object. Once accepted we may consider deprecating the non-object APIs and then removing them in a future release. This will help with changes needed for #10852 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: #11927
Is your feature request related to a problem? Please describe.
Regular expression processing can require a significant amount of temporary memory. The RAPIDS Accelerator for Apache Spark needs the ability to control how much GPU memory is used for these operations in order to avoid excessive spilling or GPU out of memory errors when the user provides a particularly complicated regex pattern and/or large input data.
Describe the solution you'd like
The libcudf regular expression APIs accept an optional parameter to be specified which is an upper bound on the amount of temporary GPU memory to use for regular expression processing. If the value is below the "natural" size for full concurrency, the algorithm would reduce the concurrency to fit within the memory bound. I would expect there would be a lower-limit below which regex processing would not be possible within the requested memory limit.
Describe alternatives you've considered
Instead of APIs focused on limiting memory there could be APIs to report what will be used without the ability to control it, such as the one implemented in #10808. This type of API does not allow the caller to tradeoff between GPU memory usage and GPU performance, as it either will fit in GPU memory or it won't. If reported as too big the RAPIDS Accelerator would be forced to fallback to the CPU to perform the regex processing (with the requisite columnar to row formatted data transform and back).
The RAPIDS Accelerator currently does not support falling back to the CPU after query planning has completed on the Spark driver (which does not have a GPU), and the query planning does not have access to the string data to search (only the regex pattern to use). Even with a memory size reporting API, without the input data the API would have to be a worst-case estimate that could cause an unnecessary fallback to the CPU.
The text was updated successfully, but these errors were encountered: