Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support inlist in LiteralGurantee for pruning #8654

Merged
merged 7 commits into from
Dec 28, 2023

Conversation

my-vegetable-has-exploded
Copy link
Contributor

@my-vegetable-has-exploded my-vegetable-has-exploded commented Dec 26, 2023

Which issue does this PR close?

Closes #8436

Rationale for this change

What changes are included in this PR?

  • refactor some codes
  • Adding InList support in LiteralGurantee code
  • Add tests in LiteralGurantee
  • Add a test for Bloom filters in row_groups.rs

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Dec 26, 2023
@@ -334,21 +357,33 @@ impl<'a> ColOpLit<'a> {
binary_expr.op(),
binary_expr.right().as_any(),
);

let guarantee = match op {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since operator except = and != would be evaluate in predicate_expr, I put the conversion process here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let first_term = &terms[0];
if terms.iter().all(|term| {
term.col.name() == first_term.col.name()
&& term.op == first_term.op
&& term.guarantee == Guarantee::In
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a != foo OR a != bar would be filtered out here, So we don't need to check new_values.len() == 1 in aggregate_multi_conjunct. If not, new_values.len() == 1 will confilct with the LiteralGuarantee result from InList who may have multi literals. And I think refactored code is easier to follow for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also double checked that this will verify that first_term.guarantee needs to be In as well ✅

@alamb
Copy link
Contributor

alamb commented Dec 26, 2023

Thank you @my-vegetable-has-exploded -- I plan to review this tomorrow.

cc @waynexia and @haohuaijin

Copy link
Contributor

@haohuaijin haohuaijin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @my-vegetable-has-exploded , LGTM!

datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added the core Core DataFusion crate label Dec 27, 2023
]);
let sql =
"SELECT * FROM tbl WHERE \"String\" IN ('Hello_Not_Exists', 'Hello_Not_Exists2')";
let expr = sql_to_physical_plan(sql).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previous test relied on the optimizer who will convert in_list expr with no more 3 elements to and expr. Now we don't need this. logical2physical would convert logical_expr to physical_expr without optimizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- this change makes sense

BTW I tried to improve these tests to avoid duplication in #8435 (not yet merged). I'll try and consolidate this test too when I get a chance

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this PR carefully -- thank you @my-vegetable-has-exploded (and @haohuaijin for the review). This PR looks very nice 👌

I left some small suggestions but nothing I think is required -- we could make them as a follow on PR as well

]);
let sql =
"SELECT * FROM tbl WHERE \"String\" IN ('Hello_Not_Exists', 'Hello_Not_Exists2')";
let expr = sql_to_physical_plan(sql).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- this change makes sense

BTW I tried to improve these tests to avoid duplication in #8435 (not yet merged). I'll try and consolidate this test too when I get a chance

.as_any()
.downcast_ref::<crate::expressions::InListExpr>()
{
//Only support single-column inlist currently, multi-column inlist is not supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice 👍

datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
let first_term = &terms[0];
if terms.iter().all(|term| {
term.col.name() == first_term.col.name()
&& term.op == first_term.op
&& term.guarantee == Guarantee::In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also double checked that this will verify that first_term.guarantee needs to be In as well ✅

@@ -334,21 +357,33 @@ impl<'a> ColOpLit<'a> {
binary_expr.op(),
binary_expr.right().as_any(),
);

let guarantee = match op {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
col("b")
.in_list(vec![lit(1), lit(2), lit(3)], false)
.and(col("b").in_list(vec![lit(2), lit(3), lit(4)], false)),
vec![in_guarantee("b", [2, 3])],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very cool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of simplification might also be valuable to do in the Expr simplifier code too (so that other optimizer passes could see simpler predicates).

However, I am not sure how common such predicates are 🤔

}

#[test]
fn test_inlist_with_disjunction() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add (negative) tests for the inlist directly used with OR too ?

For example:

b IN (1, 2, 3) OR b = 2
b IN (1, 2, 3) OR b != 3

I think the code above will work correctly, but it would be nice to verify

col("b")
.in_list(vec![lit(1), lit(2), lit(3)], false)
.and(col("b").eq(lit(4)).or(col("b").eq(lit(5)))),
vec![],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I theory this could be in_guarantee(1,2,3) right? As it can only be true when b is 1,2,3.
However in this case this expression can never be true so maybe it doesn't really matter 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be invalid since intersection between [1,2,3] and [4,5] is empty.

let intersection = new_values
    .into_iter()
    .filter(|new_value| existing.literals.contains(*new_value))
    .collect::<Vec<_>>();
// for an In guarantee, if the intersection is not empty,  we can extend the guarantee
// e.g. `a IN (1,2,3) AND a IN (2,3,4)` is `a IN (2,3)`
// otherwise, we invalidate the guarantee
// e.g. `a IN (1,2,3) AND a IN (4,5,6)` is `a IN ()`, which is invalid
if !intersection.is_empty() {
    existing.literals = intersection.into_iter().cloned().collect();
} else {
    // at least one was not, so invalidate the guarantee
    *entry = None;
}

BTW, I left a comment in #8437. Please cc when you are free.

Copy link
Contributor

@alamb alamb Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be invalid since intersection between [1,2,3] and [4,5] is empty.

Right -- I wonder if in general for cases where the intersection is empty, can we infer that the expression can not be true ever 🤔 (there is no way to represent this today in the guarantee framework, maybe something like

enum Guarantee {
  /// This predicate can not be true
  CanNotBeTrue,
  /// This predicate can only be true if all the `LiteralGurantee`s are valie
  List(Vec<LiteralGurantee>)
}

And then return Gurantee::CanNotBeTrue if we discover that the expression can not be true 🤔

BTW, I left a comment in #8437. Please cc when you are free.

I am not quite sure what you are referring to. #8437 doesn't seem to have had any recent activity that I can see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can enhance eliminate_filter optimizer rule to eliminate the filters(in where clause) that always not be true and replace by a EmptyRelation.
like we do for where false in
https://github.com/apache/arrow-datafusion/blob/8284371cb5dbeb5d0b1d50c420affb9be86b1599/datafusion/optimizer/src/eliminate_filter.rs#L55-L62

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed this idea in #8687

datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks 👍

datafusion/physical-expr/src/utils/guarantee.rs Outdated Show resolved Hide resolved
@@ -120,7 +114,7 @@ impl LiteralGuarantee {
/// expression is guaranteed to be `null` or `false`.
///
/// # Notes:
/// 1. `expr` must be a boolean expression.
/// 1. `expr` must be a boolean expression or inlist expression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, the boolean expr can only be in a form that can be transformed into a ColOpLit

@my-vegetable-has-exploded
Copy link
Contributor Author

Thanks all❤️.

@@ -780,6 +779,21 @@ mod test {
.and(col("b").eq(lit(3)).or(col("b").eq(lit(4)))),
vec![not_in_guarantee("b", [1, 2, 3]), in_guarantee("b", [3, 4])],
);
// b IN (1, 2, 3) OR b = 2
// TODO this should be in_guarantee("b", [1, 2, 3]) but currently we don't support to anylize this kind of disjunction. Only `ColOpLit OR ColOpLit` is supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 1737d49 into apache:main Dec 28, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 28, 2023

Thanks again @my-vegetable-has-exploded

@my-vegetable-has-exploded my-vegetable-has-exploded deleted the in-guarantee branch December 28, 2023 12:31
@domyway
Copy link

domyway commented Dec 28, 2023

if items more than 10, it still cant use bloom filter

@alamb
Copy link
Contributor

alamb commented Dec 28, 2023

if items more than 10, it still cant use bloom filter

In case anyone is following along, the discussion about this is at #8436 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support IN lists with more than three constants in predicates for bloom filters
5 participants