-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Remove Sort Merge Join Experimental status #9846
Comments
@alamb @ozankabak @viirya @mustafasrepo @berkaysynnada @metesynnada appreciate your inputs. |
From my experience, I have never seen SortMergeJoin used in any plan I looked at in DataFusion, so therefore I think it is still "experimental" or at least "not used by datafusion by default" (which maybe is the same thing) It looks like there was some past interest in SortMergeJoin -- https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+sortmergejoin Also the people interested in that operator seem to be the people focused on Spark |
it is used if next conditions met https://github.com/apache/arrow-datafusion/blob/81c96fc3db0ea35638278f32df066be63b745a51/datafusion/core/src/physical_planner.rs#L1136 There is also a small set of tests introduced in To enforce SMJ its needed to set
Probably we can revisit tests and run some benchmarks with SMJ enforced to make a decision? |
I believe we can add fuzz tests for SMJ to ensure it is robust. |
I'm thinking if its enough to add fuzz tests, prob we also need to run benchmarks on top of SMJ? Afaik now benchmarks are on top of the HJ? |
Is there a rule of thumb for choosing SMJ over HJ? |
I wonder how SMJ in DataFusion compares against HJ at the moment. Some ideas for when SMJ could be chosen over HJ:
|
I believe current state of the art in query processing is
The only benefit SMJ has over HJ at the moment in Datafusion is that we could plausibly join relations that are larger than memory using SMJ (using the fact that we can spill the inputs) -- this may be what @Dandandan is saying in #9846 (comment) I think it is close to impossible to make SMJ beat HJ for raw performance when the relations fit in memory |
we shouldn't be comparing HJ vs SMJ 1:1, but the performance has to be quite close? What I'm trying to solve is to find a strategy to remove the experimental flag from SMJ and prove it is stable. btw I found the fuzz tests are in place https://github.com/apache/arrow-datafusion/blob/daf182dc789230dbd9cf21ca2e975789213a5365/datafusion/core/tests/fuzz_cases/join_fuzz.rs#L128 |
I ran TPCH benchmarks for SMJ and got
|
Seems like a good reason to keep it marked as experimental |
I'll create a separate issue on it. |
Is there any plan on re-iterating on the SMJ heuristics ? This couldn't use SMJ with current heuristics: datafusion/datafusion/core/src/physical_planner.rs Lines 1114 to 1116 in f3b1141
|
In my opinion, we should make more / better knobs for this kind of tuning (to make it easier to choose what types of joins, etc to use). Adjusting heuristics will likely just make some people's plans better but others worse |
Sounds like that would probably be a good choice. Besides adding a knob / changing the default to do this, I think we will need some examples /benchmarks showing preferring SMJ over hash join in certain situations will improve performance and thus is a sane default. |
Until new configuration option is added I believe you could add your own optimiser rule to switch join implementation. |
The SMJ still needs some work to be done before we can enable it like supporting RightSemi/RightAnti. Also the issue was reported with small/empty batches in apache/datafusion-comet#1211 (comment) |
Is your feature request related to a problem or challenge?
Hi all
I was going through SMJ implementation and suddenly stepped on the comments
https://github.com/apache/arrow-datafusion/blob/81c96fc3db0ea35638278f32df066be63b745a51/datafusion/core/src/physical_planner.rs#L1141
I think it would be nice to revisit it and understand if Sort Merge Join Exec is still experimental.
And if so is there any strategies to make it stable, or to run benchmarks to prove the join is stable?
fuzz_cases::join_fuzz::test_anti_join_1k_filtered
#11555Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: