-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subsequent JOIN
s don't match to correct row
#6744
Comments
Thank you for the report @MidasLamb -- is it possible to share a reproducer (aka the data referred to above)? This seems like it may be data specific |
@alamb , I've created a reproduction repo here: https://github.com/MidasLamb/datafusion-multiple-join-bug-example If you search for "b93" in the output that is shown, you'll see in the first test that when the |
It seems to be a bug in a default optimizer, because adding this: let ctx = SessionContext::with_state(
state
.with_optimizer_rules(vec![])
); gives back the results that I expect |
Awesome - @MidasLamb . If possible, could you identify:
|
@Dandandan , I'm trying to find the default optimizer rules so I can start weeding them out, but I can't find them immediately, so that's why I already posted that little update |
No worries, this is the entire list:
|
Seems to be a caused by these two rules, commenting one of them makes it works as expected: state.with_optimizer_rules(vec![
// Arc::new(SimplifyExpressions::new()),
// Arc::new(UnwrapCastInComparison::new()),
// Arc::new(ReplaceDistinctWithAggregate::new()),
// Arc::new(EliminateJoin::new()),
// Arc::new(DecorrelatePredicateSubquery::new()),
// Arc::new(ScalarSubqueryToJoin::new()),
Arc::new(ExtractEquijoinPredicate::new()),
// Arc::new(SimplifyExpressions::new()),
// Arc::new(MergeProjection::new()),
// Arc::new(RewriteDisjunctivePredicate::new()),
// Arc::new(EliminateDuplicatedExpr::new()),
// Arc::new(EliminateFilter::new()),
Arc::new(EliminateCrossJoin::new()),
// Arc::new(CommonSubexprEliminate::new()),
// Arc::new(EliminateLimit::new()),
//
// Arc::new(PropagateEmptyRelation::new()),
// Arc::new(FilterNullJoinKeys::default()),
// Arc::new(EliminateOuterJoin::new()),
// Arc::new(PushDownLimit::new()),
// Arc::new(PushDownFilter::new()),
// Arc::new(SingleDistinctToGroupBy::new()),
// Arc::new(SimplifyExpressions::new()),
// Arc::new(UnwrapCastInComparison::new()),
// Arc::new(CommonSubexprEliminate::new()),
// Arc::new(PushDownProjection::new()),
// Arc::new(EliminateProjection::new()),
// Arc::new(PushDownLimit::new()),
]), Individual plansPlan without bug:
Plan with bug:
|
It seems to replace an |
I've found this piece of comment/code in the // The filter of inner join will lost, skip this rule.
// issue: https://github.com/apache/arrow-datafusion/issues/4844
if join.filter.is_some() {
return Ok(None);
} However in my scenario the filtered join is not the top level, so that checks seems to get skipped. I've set up a test for this by doing this: let t1 = test_table_scan_with_name("t1")?;
let t2 = test_table_scan_with_name("t2")?;
let t3 = test_table_scan_with_name("t3")?;
let plan = LogicalPlanBuilder::from(t1)
.join(t2, JoinType::Inner, (Vec::<Column>::new(), Vec::<Column>::new()), Some(col("t1.a").eq(col("t2.a"))))?
.join(
t3,
JoinType::Inner,
(
vec![Column::from_qualified_name("t2.a")],
vec![Column::from_qualified_name("t3.a")],
),
None,
)?
.filter(col("t2.c").lt(lit(20u32)))?
.build()?; |
Describe the bug
I have a query where I
JOIN
three tables, and then have a WHERE clause on the last table ("twintag.owner"). If I give a value that exists there, all the results are returned (regardless of whether or not the "twintag.owner" matches the actual value), and if I give a value that doesn't exist, it returns nothing.The query is as follows:
If I start from the intermediate table it works as expected and I get back the one result I'm looking for:
To Reproduce
Create some tables where you can construct a JOIN from table A to B and from B to C.
Execute the (type of) query, where you use the intermediate table to join from table A to C, filtering on something in table C:
Expected behavior
I expect only the items which match the filter on table C to be returned.
Currently it returns either ALL items if an item matches, or NO items if there is no match found. In the example above I expect 1 match, but I get all the items from table A back.
If I change the value I'm looking for in table C to be non-existent there, I get back no results instead, which is expected.
Additional context
I'm also using datafusion-remote-table from seafowl (https://github.com/splitgraph/seafowl/tree/main/datafusion_remote_tables)
The text was updated successfully, but these errors were encountered: