[FEA] Expose size estimation for conditional joins #8918
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
Is your feature request related to a problem? Please describe.
Similar to #8237, Spark needs the ability to estimate join output sizes to avoid GPU out of memory errors. The ability to estimate the number of output rows for a hash join is key to avoiding performing a join that may explode and allows Spark to split up the join in cases where the number of rows would be too big to manifest in one pass.
Describe the solution you'd like
For each conditional join type, an API should be exposed in join.hpp that can be called with the same parameters as the corresponding conditional join API but instead of returning the full gather map results of the join it instead returns the number of output rows from the join (something that I believe is already being computed internally).
Similar to the solution for #8237 it would also be nice to extend the existing conditional join APIs so the output row count can be passed to the conditional join so it does not need to redundantly compute the output rows again when the application has already computed it.
Describe alternatives you've considered
Like #8237 we could try to play games with catching GPU OOM errors and trying to recover, but this causes as many problems as it solves in practice and is difficult to always discern how many rows are being requested since the code triggering the OOM won't necessarily be the conditional join code.
The text was updated successfully, but these errors were encountered: