[FEA] Improved diagnostics for regex kernel launch failures due to insufficient reserved memory #4511

jlowe · 2022-01-12T18:01:26Z

Is your feature request related to a problem? Please describe.
When executing a regular expression kernel, there can be a significant amount of thread stack space needed to launch the kernel. Unfortunately the default plugin setup allocates almost all of the GPU memory for the RMM pool, and without using the CUDA async allocator to back the RMM pool, the driver has only the remaining memory outside of the RMM pool to use for launching kernels. If there is insufficient memory outside of the pool to satisfy the kernel launch, and OOM error will be thrown. This error can be very confusing to users since they may think they need to increase the size of the RMM pool rather than decrease it to solve this problem.

Describe the solution you'd like
Ideally switching to the async allocator would be best, but if we cannot do that in a timely manner then it would be nice to have the regex code catch OOM errors when launching the kernels and wrap the OOM error with a new error using a message indicating that potentially the error is caused by a kernel launch failure due to insufficient reserved memory rather than insufficient RMM pool memory. (Logging the free memory available in the RMM in the message would be a nice bonus to help the user determine which is more likely.)

jlowe · 2022-01-13T18:18:16Z

Another potential solution here, which may be far preferable, is to have cudf throw different kinds of exceptions based on the mode of failure, or at least change the message text of the error being thrown based on the situation.

At the C++ level, RMM throws a specific type of exception when the pool runs out of memory which hopefully can be distinguished from the exception libcudf throws for a CUDA error resulting from out of memory, the latter being an error that necessarily is a driver-level error implying it must be an issue with insufficient reserved memory. We could then throw different Java exceptions or change the message text based on whether it was an RMM out of memory or a CUDA out of memory.

jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jan 12, 2022

jlowe mentioned this issue Jan 12, 2022

[FEA] Enable regular expressions by default #4509

Open

61 tasks

sameerz added task Work required that improves the product but is not user facing and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 18, 2022

jlowe closed this as not planned Won't fix, can't repro, duplicate, stale Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improved diagnostics for regex kernel launch failures due to insufficient reserved memory #4511

[FEA] Improved diagnostics for regex kernel launch failures due to insufficient reserved memory #4511

jlowe commented Jan 12, 2022

jlowe commented Jan 13, 2022

[FEA] Improved diagnostics for regex kernel launch failures due to insufficient reserved memory #4511

[FEA] Improved diagnostics for regex kernel launch failures due to insufficient reserved memory #4511

Comments

jlowe commented Jan 12, 2022

jlowe commented Jan 13, 2022