Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Expose size estimation for conditional joins #8918

Closed
jlowe opened this issue Jul 30, 2021 · 1 comment · Fixed by #8928
Closed

[FEA] Expose size estimation for conditional joins #8918

jlowe opened this issue Jul 30, 2021 · 1 comment · Fixed by #8928
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Jul 30, 2021

Is your feature request related to a problem? Please describe.
Similar to #8237, Spark needs the ability to estimate join output sizes to avoid GPU out of memory errors. The ability to estimate the number of output rows for a hash join is key to avoiding performing a join that may explode and allows Spark to split up the join in cases where the number of rows would be too big to manifest in one pass.

Describe the solution you'd like
For each conditional join type, an API should be exposed in join.hpp that can be called with the same parameters as the corresponding conditional join API but instead of returning the full gather map results of the join it instead returns the number of output rows from the join (something that I believe is already being computed internally).

Similar to the solution for #8237 it would also be nice to extend the existing conditional join APIs so the output row count can be passed to the conditional join so it does not need to redundantly compute the output rows again when the application has already computed it.

Describe alternatives you've considered
Like #8237 we could try to play games with catching GPU OOM errors and trying to recover, but this causes as many problems as it solves in practice and is difficult to always discern how many rows are being requested since the code triggering the OOM won't necessarily be the conditional join code.

@jlowe jlowe added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jul 30, 2021
@vyasr
Copy link
Contributor

vyasr commented Jul 30, 2021

You are correct, the output size is already being computed internally and can easily be exposed.

@vyasr vyasr added this to the Conditional Joins milestone Jul 30, 2021
@vyasr vyasr self-assigned this Jul 30, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Aug 3, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 13, 2021
Resolves #8918 by providing a new API for getting the output size for conditional joins (except full joins). This PR removes the unnecessary `conditional_join.cuh` header and inlines the logic into the `conditional_join.cu` file where it is used, and adds the new logic into that file as well. The public APIs are now exposed in `conditional_join.hpp`.

Adding the join size calculation also revealed a couple of bugs in the conditional join tests that were hiding a real bug in the conditional join implementation. The main test bug was the use of `std::equal` with the actual result as the first iterator, so if the actual result was empty it was never compared against the expected result (even if it was nonempty). This bug masked a couple of minor errors in the expected outputs encoded in the test. These are now fixed. This bug was also hiding a deeper issue where the AST device code was always using the left row index to pull data for the left hand operand to binary operations, even when the lhs was actually a column from the right table. That bug is now fixed as well.

Contributes to #8145.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - MithunR (https://github.com/mythrocks)

URL: #8928
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants