Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Expose size estimation for join #8237

Closed
revans2 opened this issue May 13, 2021 · 4 comments
Closed

[FEA] Expose size estimation for join #8237

revans2 opened this issue May 13, 2021 · 4 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented May 13, 2021

Is your feature request related to a problem? Please describe.
We have started to work on joins that are too large to fit in GPU memory. The first step in this was to get back the gather maps from the join instead of materializing the results directly, and then be able to output the results in batches that are the appropriate size. For the most part this is working really well (All of the NDS queries except for 1 we are able to run at scale factor 3000). For query 72, however, it is not great. In a typical run we are seeing a join that really explodes. When joining a 600MB table by a 400MB table we see the join try to allocate over 260GB just for the gather map. (The row count on average will increase 2,000x).

We run into issues in part because this alloc request is so large that we end up spilling a lot of data to host memory, or even disk, when we could avoid doing it if we knew the join would explode, so we could split the input batches up.

Part of the problem is that this is an equi-join with an inequality join as well. So the 2,000x increase in size is before filtering. After the filter it is only a 200x explosion in size. So the AST work at #8214 will be a huge win for us. But even then on this join the gather map would be 26GB, which is larger than can fit on a typical 16GB GPU.

Describe the solution you'd like
We would like to be able to expose an API in cudf::hash_join that can return an accurate size estimate similar to cudf::detail::estimate_join_output_size. Talking to @jrhemstad he said that we might want to have the output size be exact instead of the estimate it is now.

Then also have the join APIs in cudf::hash_join optionally take in the output of the size API so they don't have to re-compute it in the common case.

This way we can get the output size of the join ahead of time and depending on how large it is decide to go ahead with the join, with no performance penalty, or if the gather map is larger than we have a memory budget for, we can split up the incoming batches into smaller pieces, and we can use the output size of the join to estimate how big each batch should be.

Describe alternatives you've considered
Catch the out of memory error when it cannot allocate the gather map and do size estimation ourselves by doing a count for each key, followed by min, max, and average on those keys to inform us on how big to split each incoming batch.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels May 13, 2021
@jrhemstad
Copy link
Contributor

What I'd like to see here is 3 new APIs for the hash_join class, one for each type of join that returns the exact number of matches with the specified probe table, e.g.:

std::size_t inner_join_size(cudf::table_view const& probe, null_equality compare_nulls = null_equality::EQUAL, rmm::cuda_stream_view stream = rmm::cuda_stream_default,

Then for each of the existing join APIs, add a std::optional<std::size_t> to allow the user to specify the output size:

  std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
            std::unique_ptr<rmm::device_uvector<size_type>>>
  inner_join(cudf::table_view const& probe,
             null_equality compare_nulls         = null_equality::EQUAL,
             std::optional<std::size_t> output_size = {},
             rmm::cuda_stream_view stream        = rmm::cuda_stream_default,
             rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;

If output_size.has_value(), we'll use that value for the output size. It is undefined behavior if output_size is smaller than the actual output size. Otherwise, we'll compute it.

Like Bobby said, I want to use this opportunity to just get rid of the size estimation logic all together. We should just always compute an exact value, whether the user is requesting it explicitly via inner_join_size or we're computing it internally.

@jrhemstad
Copy link
Contributor

CC @magnatelee I recall long ago you were asking for the same thing.

@jlowe jlowe added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels May 13, 2021
rapids-bot bot pushed a commit that referenced this issue Jun 14, 2021
Addresses #8237

This PR adds 3 join size APIs (`hash_join::inner_join_size`, `hash_join::left_join_size` and `hash_join::full_join_size`) into `hash_join` class, one for each type of join that returns the exact number of matches with the specified probe table. It completely removed the deprecated size estimation logic in the current implementation.

Also, this PR updates the existing join APIs by adding an optional `output_size` as an argument. If `output_size.has_value()`, we take that value directly for further computation. Otherwise, the target join will internally invoke its corresponding size function.

`TODO`: the current `full_join_size` uses a 2-step algorithm similar to what's used in `hash_join::full_join`. It duplicates certain computations with `full_join` also thus should be refactored during `cuco` integration.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Robert Maynard (https://github.com/robertmaynard)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #8453
@beckernick
Copy link
Member

@PointKernel based on the discussion, it looks like this was fully addressed by #8453 . Can this be closed?

@PointKernel
Copy link
Member

Thanks for the reminder. Let me close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

5 participants