Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize queries where lhs and rhs of predicate are equal #10444

Merged
merged 9 commits into from
Mar 28, 2023
Merged

optimize queries where lhs and rhs of predicate are equal #10444

merged 9 commits into from
Mar 28, 2023

Conversation

jadami10
Copy link
Contributor

@jadami10 jadami10 commented Mar 19, 2023

This is a minor performance bugfix and closes #10383

  1. this fixes NullPointerExceptions in existing optimizers when performing WHERE 1=1 queries. These would fail because the filter expression had no function call
  2. I noticed that WHERE 1=1 was no simplified, but WHERE col1>0 AND 1=1 was actually being simplified in the NumericalFilterOptimizer. So I put that part in a separate class to be used more generally for future cases like this
    • it does a little more work than expected once it sees and AND/OR/NOT expression
    • something else is converting 1=1 to literal TRUE, but I'm not sure where that is
  3. This adds a IdenticalPredicateFilterOptimizer class that converts WHERE 1=1 or WHERE "colA"!="colA" to TRUE/FALSE respectively

I've added a bunch more test cases, and I've tested manually in the Quickstart app. This is my first contribution to the query parsing part of the code base, so I don't have a great sense what test coverage looks like. But I imagine between unit and integration tests, this should catch any glaring breaks?

@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2023

Codecov Report

Merging #10444 (ffd0f02) into master (f136a08) will increase coverage by 7.16%.
The diff coverage is 74.81%.

@@             Coverage Diff              @@
##             master   #10444      +/-   ##
============================================
+ Coverage     63.14%   70.31%   +7.16%     
- Complexity     5914     6112     +198     
============================================
  Files          2045     2072      +27     
  Lines        111202   112021     +819     
  Branches      16963    17068     +105     
============================================
+ Hits          70217    78762    +8545     
+ Misses        35818    27717    -8101     
- Partials       5167     5542     +375     
Flag Coverage Δ
integration1 24.55% <53.33%> (+0.03%) ⬆️
integration2 24.18% <51.85%> (-0.11%) ⬇️
unittests1 67.97% <74.81%> (-0.04%) ⬇️
unittests2 13.98% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../pinot/controller/recommender/io/InputManager.java 93.22% <ø> (+93.22%) ⬆️
...ery/optimizer/filter/NumericalFilterOptimizer.java 82.39% <57.69%> (+0.37%) ⬆️
.../optimizer/filter/FlattenAndOrFilterOptimizer.java 85.00% <66.66%> (-3.89%) ⬇️
...izer/filter/IdenticalPredicateFilterOptimizer.java 70.00% <70.00%> (ø)
...imizer/filter/BaseAndOrBooleanFilterOptimizer.java 82.50% <82.50%> (ø)
...ery/optimizer/filter/MergeEqInFilterOptimizer.java 89.41% <82.85%> (-3.19%) ⬇️
...che/pinot/core/query/optimizer/QueryOptimizer.java 100.00% <100.00%> (ø)

... and 312 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, great job!

@@ -170,6 +170,8 @@ private void validateQueries() {
for (String queryString : _queryWeightMap.keySet()) {
try {
PinotQuery pinotQuery = CalciteSqlParser.compileToPinotQuery(queryString);
// TODO: should we catch and ignore any errors here. If we error on query optimization,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Ignoring the error is more robust, while failing the query can help catch the bug in the optimizer and prevent certain unexpected performance degradation. Currently optimize logic is applied in-place (there is no return value), so I personally prefer directly failing the query since the query might already be modified and messed up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I was thinking the same about the fact that it might be only semi optimized when this fails. I've updated the comment to reflect this state

@@ -44,7 +45,7 @@ public class QueryOptimizer {
// values to the proper format so that they can be properly parsed
private static final List<FilterOptimizer> FILTER_OPTIMIZERS =
Arrays.asList(new FlattenAndOrFilterOptimizer(), new MergeEqInFilterOptimizer(), new NumericalFilterOptimizer(),
new TimePredicateFilterOptimizer(), new MergeRangeFilterOptimizer());
new TimePredicateFilterOptimizer(), new MergeRangeFilterOptimizer(), new IdenticalPredicateFilterOptimizer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we apply this optimizer in the end? If it doesn't rely on other optimizers, we can put it next to the flatten optimizer to avoid other optimizer to optimize on identical predicate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. it actually caught another null point exception moving it earlier

*/
public abstract class BaseAndOrBooleanFilterOptimizer implements FilterOptimizer {

protected static final Expression TRUE = RequestUtils.getLiteralExpression(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file doesn't follow the Pinot Style

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good eye. I was working on a new laptop and hadn't set that up.

protected static final Expression FALSE = RequestUtils.getLiteralExpression(false);

@Override
public abstract Expression optimize(Expression filterExpression, @Nullable Schema schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) No need to override this API to an abstract method

}

@Override
protected boolean isAlwaysFalse(Expression operand) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to override this method? After the DFS, all the children should already be optimized

return expression;
}

protected boolean isAlwaysFalse(Expression operand) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need to override this method (see comment below), we can change optimizeCurrent into a util method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked the interface to be a little clearer. The base class handles the DFS, and users just implement the base case. Let me know if this looks better.

@jadami10
Copy link
Contributor Author

Looks good in general, great job!

thank you! i see all checks passed. let me know if you have further comments, though

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

}

/** Change the expression value to boolean literal with given value. */
protected static void setExpressionToBoolean(Expression expression, boolean value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced in this PR, but let's remove this method since we should avoid mutating an expression. We can use the constant TRUE and FALSE instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. There's still some more mutation in NumericalFilterOptimizer, but this gets rid of a big part

Comment on lines 127 to 133
private boolean isAlwaysFalse(Expression operand) {
return operand.equals(FALSE);
}

private boolean isAlwaysTrue(Expression operand) {
return operand.equals(TRUE);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Slightly more readable if we just inline them or rename to isTrue() and isFalse()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. it made more sense in the initial PR

case OR:
case NOT:
// Recursively traverse the expression tree to find an operator node that can be rewritten.
operands.forEach(operand -> optimize(operand, schema));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use replaceAll() here so that it still works when the optimize is not applied inplace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

switch (kind) {
case EQUALS:
if (hasIdenticalLhsAndRhs(filterExpression)) {
setExpressionToBoolean(filterExpression, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly return TRUE, same for the following

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, even better

FilterKind kind = FilterKind.valueOf(function.getOperator());
switch (kind) {
case EQUALS:
if (hasIdenticalLhsAndRhs(filterExpression)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Directly pass the operands

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty, no need to recompute it all

return false;
}
List<Expression> children = function.getOperands();
boolean hasTwoChildren = children.size() == 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Is there any case EQ or NEQ can have other than 2 children? We can probably make a precondition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a precondition would fail the query, no? even if it is possible, this function is really only supposed to optimize this one case.


/**
* Pinot queries of the WHERE 1 != 1 AND "col1" = "col2" variety are rewritten as
* 1-1 != 0 AND "col1"-"col2" = 0. Therefore, we check specifically for the case where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewrite is already happening in PredicateComparisonRewriter.updateFunctionExpression(), so we might just compare the lhs and rhs there.

Since we already get this implementation, we can add a TODO here and revisit later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see that class until now. But even then, I think I slightly prefer this more as an optimization than a rewrite. But it's probably easier to do there before it gets converted just for us to convert it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filter clause with where <literal>=<literal> or <identifier>=<identifier> fail or are slow
3 participants