Implement scalable distributed joins #63

andygrove · 2021-04-25T13:54:15Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The main issue limiting scalability in Ballista today is that joins are implemented as hash joins where each partition of the probe side causes the entire left side to be loaded into memory.

Describe the solution you'd like

To make this scalable we need to hash partition left and right inputs so that we can join the left and right partitions in parallel.

There is already work underway in DataFusion to implement this that we can leverage.

Describe alternatives you've considered
None

Additional context
None

boazberman · 2021-04-30T16:27:28Z

I'd love to work on this if someone can provide further reading material and/or the area in the code

andygrove · 2021-05-22T14:10:27Z

Here is some additional information. When I run TPC-H query 5 in the benchmarks, against DataFusion, I see that the physical plan used partitioned joins.

For example, I see that both inputs to the join are partitioned on the join keys, and the join mode is Partitioned.

HashJoinExec: mode=Partitioned, join_type=Inner, on=[("c_custkey", "o_custkey")]
  RepartitionExec: partitioning=Hash([Column { name: "c_custkey" }], 24)
    ParquetExec: batch_size=8192, limit=None, partitions=[...]
  RepartitionExec: partitioning=Hash([Column { name: "o_custkey" }], 24)
    FilterExec: o_orderdate >= CAST(1994-01-01 AS Date32) AND o_orderdate < CAST(1995-01-01 AS Date32)
      ParquetExec: batch_size=8192, limit=None, partitions=[...]

This means that the join can run in parallel because the inputs are partitioned. So partition 1 of the join reads partition 1 of the left and right inputs, and so on.

When I run the same query against Ballista, I see.

HashJoinExec: mode=CollectLeft, join_type=Inner, on=[("c_custkey", "o_custkey")]
  ParquetExec: batch_size=8192, limit=None, partitions=[...]
  FilterExec: o_orderdate >= CAST(1994-01-01 AS Date32) AND o_orderdate < CAST(1995-01-01 AS Date32)
    ParquetExec: batch_size=8192, limit=None, partitions=[

Here, we see join mode CollectLeft, which means that each partition being executed will go and fetch the entire left-side of the join into memory. This is very inefficient both in terms of memory and compute and potentially gets exponentially slower the more partitions we have.

What we need to do is apply the same "partitioned hash join" pattern to Ballista.

andygrove · 2021-06-09T12:06:17Z

I created a Google doc to discuss the design, and planned work, in more detail.

https://docs.google.com/document/d/1yUnGWsHKYOAxWijDJisEFYU4dIym_GSRSMpwfWjVZq8/edit?usp=sharing

…me, record output, etc (#63)

…me, record output, etc (#63) (#2965) Co-authored-by: yangzhong <[email protected]>

* Skip casting to binary when inner expr is value (#60) * Skip casting to binary when inner expr is value * Update datafusion/sql/src/unparser/expr.rs Co-authored-by: Jack Eadie <[email protected]> --------- Co-authored-by: Jack Eadie <[email protected]> * Fix binary view cast (#63) * fix * Fix clippy error --------- Co-authored-by: Jack Eadie <[email protected]>

andygrove added enhancement New feature or request ballista labels Apr 25, 2021

andygrove mentioned this issue May 31, 2021

Ballista: Implement map-side of shuffle #456

Closed

andygrove self-assigned this Jun 27, 2021

andygrove mentioned this issue Jun 27, 2021

Ballista: Implement scalable distributed joins #634

Merged

Dandandan closed this as completed in #634 Jul 4, 2021

Ted-Jiang referenced this issue in Ted-Jiang/arrow-datafusion Jul 26, 2022

Add baseline_metrics for FileStream to record metrics like elapsed ti…

585834d

…me, record output, etc (#63)

Ted-Jiang mentioned this issue Jul 26, 2022

Add baseline_metrics for FileStream to record metrics like elapsed ti… #2965

Merged

alamb pushed a commit that referenced this issue Jul 27, 2022

Add baseline_metrics for FileStream to record metrics like elapsed ti…

16f8934

…me, record output, etc (#63) (#2965) Co-authored-by: yangzhong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement scalable distributed joins #63

Implement scalable distributed joins #63

andygrove commented Apr 25, 2021

boazberman commented Apr 30, 2021

andygrove commented May 22, 2021

andygrove commented Jun 9, 2021

Implement scalable distributed joins #63

Implement scalable distributed joins #63

Comments

andygrove commented Apr 25, 2021

boazberman commented Apr 30, 2021

andygrove commented May 22, 2021

andygrove commented Jun 9, 2021