Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Remove Sort Merge Join Experimental status #9846

Open
15 of 19 tasks
comphead opened this issue Mar 29, 2024 · 17 comments
Open
15 of 19 tasks

[Epic] Remove Sort Merge Join Experimental status #9846

comphead opened this issue Mar 29, 2024 · 17 comments
Labels
enhancement New feature or request

Comments

@comphead
Copy link
Contributor

comphead commented Mar 29, 2024

Is your feature request related to a problem or challenge?

Hi all

I was going through SMJ implementation and suddenly stepped on the comments

// Sort-Merge join support currently is experimental

https://github.com/apache/arrow-datafusion/blob/81c96fc3db0ea35638278f32df066be63b745a51/datafusion/core/src/physical_planner.rs#L1141

I think it would be nice to revisit it and understand if Sort Merge Join Exec is still experimental.
And if so is there any strategies to make it stable, or to run benchmarks to prove the join is stable?

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@comphead comphead added the enhancement New feature or request label Mar 29, 2024
@comphead
Copy link
Contributor Author

@alamb @ozankabak @viirya @mustafasrepo @berkaysynnada @metesynnada appreciate your inputs.

@alamb
Copy link
Contributor

alamb commented Mar 29, 2024

From my experience, I have never seen SortMergeJoin used in any plan I looked at in DataFusion, so therefore I think it is still "experimental" or at least "not used by datafusion by default" (which maybe is the same thing)

It looks like there was some past interest in SortMergeJoin -- https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+sortmergejoin

Also the people interested in that operator seem to be the people focused on Spark

@comphead
Copy link
Contributor Author

it is used if next conditions met https://github.com/apache/arrow-datafusion/blob/81c96fc3db0ea35638278f32df066be63b745a51/datafusion/core/src/physical_planner.rs#L1136

There is also a small set of tests introduced in sort_merge_join.slt. And the plans there shows SMJ

To enforce SMJ its needed to set

set datafusion.optimizer.prefer_hash_join = false;

Probably we can revisit tests and run some benchmarks with SMJ enforced to make a decision?

@metesynnada
Copy link
Contributor

I believe we can add fuzz tests for SMJ to ensure it is robust.

@comphead
Copy link
Contributor Author

comphead commented Apr 2, 2024

I'm thinking if its enough to add fuzz tests, prob we also need to run benchmarks on top of SMJ? Afaik now benchmarks are on top of the HJ?

@metesynnada
Copy link
Contributor

Is there a rule of thumb for choosing SMJ over HJ?

@Dandandan
Copy link
Contributor

Is there a rule of thumb for choosing SMJ over HJ?

I wonder how SMJ in DataFusion compares against HJ at the moment.

Some ideas for when SMJ could be chosen over HJ:

  • When input data is already sorted on relevant keys, it is likely faster/requires less memory to plan a SMJ than HJ.
  • HJ might require more memory than SMJ, so whenever e.g. data skew is expected one might choose sort merge over hash join.

@alamb
Copy link
Contributor

alamb commented Apr 3, 2024

Is there a rule of thumb for choosing SMJ over HJ?

I believe current state of the art in query processing is

  1. If the data is already sorted by join keys, use MergeJoin (as @Dandandan says)
  2. If the data is not already sorted on join key, use HashJoin
  3. If HashJoin runs out of memory building the hash table, spill the table to disk (possibly switching to merge join internally)

The only benefit SMJ has over HJ at the moment in Datafusion is that we could plausibly join relations that are larger than memory using SMJ (using the fact that we can spill the inputs) -- this may be what @Dandandan is saying in #9846 (comment)

I think it is close to impossible to make SMJ beat HJ for raw performance when the relations fit in memory

@comphead
Copy link
Contributor Author

comphead commented Apr 3, 2024

we shouldn't be comparing HJ vs SMJ 1:1, but the performance has to be quite close? What I'm trying to solve is to find a strategy to remove the experimental flag from SMJ and prove it is stable.

btw I found the fuzz tests are in place https://github.com/apache/arrow-datafusion/blob/daf182dc789230dbd9cf21ca2e975789213a5365/datafusion/core/tests/fuzz_cases/join_fuzz.rs#L128

@comphead
Copy link
Contributor Author

I ran TPCH benchmarks for SMJ and got

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/joins/sort_merge_join.rs:1357:22:
index out of bounds: the len is 0 but the index is 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/joins/sort_merge_join.rs:1357:22:
index out of bounds: the len is 0 but the index is 1
thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/joins/sort_merge_join.rs:1357:22:
index out of bounds: the len is 0 but the index is 1
thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/joins/sort_merge_join.rs:1357:22:
index out of bounds: the len is 0 but the index is 1
Error: Context("Join Error", External(JoinError::Panic(Id(88693), ...)))

@alamb
Copy link
Contributor

alamb commented Apr 16, 2024

Seems like a good reason to keep it marked as experimental

@comphead
Copy link
Contributor Author

Seems like a good reason to keep it marked as experimental

I'll create a separate issue on it.
Once TPCH passed we can get back on SMJ status

@16pierre
Copy link

Is there a rule of thumb for choosing SMJ over HJ?

Is there any plan on re-iterating on the SMJ heuristics ?
For example, a use-case of mine would be: input data is pre-sorted + target_partitions = 1 to use a single CPU core (wonder if there are better knobs that target_partitions for this btw).

This couldn't use SMJ with current heuristics:

} else if session_state.config().target_partitions() > 1
&& session_state.config().repartition_joins()
&& !prefer_hash_join

@alamb
Copy link
Contributor

alamb commented Jan 5, 2025

This couldn't use SMJ with current heuristics:

In my opinion, we should make more / better knobs for this kind of tuning (to make it easier to choose what types of joins, etc to use). Adjusting heuristics will likely just make some people's plans better but others worse

@Dandandan
Copy link
Contributor

Is there a rule of thumb for choosing SMJ over HJ?

Is there any plan on re-iterating on the SMJ heuristics ? For example, a use-case of mine would be: input data is pre-sorted + target_partitions = 1 to use a single CPU core (wonder if there are better knobs that target_partitions for this btw).

This couldn't use SMJ with current heuristics:

} else if session_state.config().target_partitions() > 1
&& session_state.config().repartition_joins()
&& !prefer_hash_join

Sounds like that would probably be a good choice. Besides adding a knob / changing the default to do this, I think we will need some examples /benchmarks showing preferring SMJ over hash join in certain situations will improve performance and thus is a sane default.

@milenkovicm
Copy link
Contributor

Is there a rule of thumb for choosing SMJ over HJ?

Is there any plan on re-iterating on the SMJ heuristics ?

For example, a use-case of mine would be: input data is pre-sorted + target_partitions = 1 to use a single CPU core (wonder if there are better knobs that target_partitions for this btw).

This couldn't use SMJ with current heuristics:

} else if session_state.config().target_partitions() > 1
&& session_state.config().repartition_joins()
&& !prefer_hash_join

Until new configuration option is added I believe you could add your own optimiser rule to switch join implementation.

@comphead
Copy link
Contributor Author

comphead commented Jan 5, 2025

The SMJ still needs some work to be done before we can enable it like supporting RightSemi/RightAnti. Also the issue was reported with small/empty batches in apache/datafusion-comet#1211 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants