-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement scalable distributed joins #63
Comments
I'd love to work on this if someone can provide further reading material and/or the area in the code |
Here is some additional information. When I run TPC-H query 5 in the benchmarks, against DataFusion, I see that the physical plan used partitioned joins. For example, I see that both inputs to the join are partitioned on the join keys, and the join mode is
This means that the join can run in parallel because the inputs are partitioned. So partition 1 of the join reads partition 1 of the left and right inputs, and so on. When I run the same query against Ballista, I see.
Here, we see join mode What we need to do is apply the same "partitioned hash join" pattern to Ballista. |
I created a Google doc to discuss the design, and planned work, in more detail. https://docs.google.com/document/d/1yUnGWsHKYOAxWijDJisEFYU4dIym_GSRSMpwfWjVZq8/edit?usp=sharing |
…me, record output, etc (#63)
…me, record output, etc (#63) (#2965) Co-authored-by: yangzhong <[email protected]>
* Skip casting to binary when inner expr is value (#60) * Skip casting to binary when inner expr is value * Update datafusion/sql/src/unparser/expr.rs Co-authored-by: Jack Eadie <[email protected]> --------- Co-authored-by: Jack Eadie <[email protected]> * Fix binary view cast (#63) * fix * Fix clippy error --------- Co-authored-by: Jack Eadie <[email protected]>
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The main issue limiting scalability in Ballista today is that joins are implemented as hash joins where each partition of the probe side causes the entire left side to be loaded into memory.
Describe the solution you'd like
To make this scalable we need to hash partition left and right inputs so that we can join the left and right partitions in parallel.
There is already work underway in DataFusion to implement this that we can leverage.
Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: