-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Fix ORDER BY YEAR() function #51562
Conversation
Pinging @elastic/es-search (:Search/SQL) |
Previously, if `YEAR()` was used as and ORDER BY argument without being wrapped with another scalar (e.g. `YEAR(birth_date) + 10`), no script ordering was used but instead the underlying field (e.g. `birth_date`) was used instead as a performance optimisation. This works correctly if YEAR() is the only ORDER BY arg but if further args are used as tie breakers for the ordering wrong results are produced. This is because 2 rows with the different `birth_date but on the same year are not tied as the underlying ordering is on `birth_date` and not on the `YEAR(birth_date)`, and the following ORDER BY args are ignored. Remove this optimisation for `YEAR()` to avoid incorrect results in such cases. Fixes: elastic#51224
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the tests has wrong results.
Chirstian |Koblick | ||
null |Chappelet | ||
Zvonko |Nyanchama | ||
Parto |Bamford |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not relevant (anymore). Can you add dep.from_date
to the projections?
Why did the order change since the PR should address ONLY the case of ordering by year and something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order can change because previously we would have a strict order on the whole from_date
(including day and month) but now simply on the YEAR, so (apart from the YEAR) can be affected by the order of the doc entries when inserted to the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so how does the query dsl look like now? Previously I think it contained a "sort" : [{"from_date" : {"order" : "asc"}}]
section. We don't have tests that check the generated ES query with a sort by a date field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, since a scalar is used, it needs to be translated into a scripted sorting, which currently is not supported for nested fields.
Chirstian |Koblick |1986 | ||
null |Chappelet |1988 | ||
Zvonko |Nyanchama |1989 | ||
Parto |Bamford |1995 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordering here is completely wrong. 1995, 1986, 1996.... The query has ORDER BY start
where start
is YEAR(dep.from_date)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooops! I rushed here, sorry, need to check it. Thx for catching!
Seems that with the fix, YEAR() becomes always a script, and we cannot have scripted sorting on nested fields, It's a current restriction in ES. My suggestion is to catch this in the Verifier and not allow to use scalars on nested fields. |
As discussed, added a detection to prevent usage of scalars on nested fields in ORDER BY. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// used if the function is monotonic and thus does not have to be computed for ordering purposes | ||
// null means the script needs to be used; expression means the field/expression to be used instead | ||
// Every function needs to be translated to a script | ||
// which will be used for the ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the method can be removed entirely - is it used anywhere else?
List<FieldAttribute> nested = new ArrayList<>(); | ||
Consumer<FieldAttribute> match = fa -> { | ||
Consumer<FieldAttribute> match = (fa) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why (fa)
?
if (fa.isNested()) { | ||
nested.add(fa); | ||
} | ||
}; | ||
|
||
// nested fields shouldn't be used in aggregates or having (yet) | ||
p.forEachDown(a -> a.groupings().forEach(agg -> agg.forEachUp(match, FieldAttribute.class)), Aggregate.class); | ||
p.forEachDown(a -> a.groupings().forEach(agg -> agg.forEachUp(e -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit - this type of call seems to be used 3 times (last time twice) - maybe could be extracted into a small method to avoid repetition.
Maybe check if it's used anywhere else so even more of a reuse...
Please add also a section in the docs mentioning that scalar functions cannot be used in ordering on nested fields. Maybe even filtering... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
But please, add a test to QueryTranslatorTests for ORDER BY YEAR() without GROUPing BY YEAR(). This PR proves that a small enough change (the overriden orderBy() method removal in Year) can change significantly the generated ES query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few comments.
|
||
[source, sql] | ||
-------------------------------------------------- | ||
SELECT * FROM test_emp WHERE length(dep.dep_name.keyword) > 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capital letters for length
?
if (fa.isNested()) { | ||
nested.add(fa); | ||
} | ||
}; | ||
Consumer<Expression> checkForNested = e -> | ||
attributeRefs.getOrDefault(e, e).forEachUp(matchNested, FieldAttribute.class); | ||
Consumer<ScalarFunction> checkForNestedInFcuntion = f -> f.arguments().forEach( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the variable name: checkForNestedInFcuntion
.
if (sf.orderBy() != null) { | ||
Expression orderBy = sf.orderBy(); | ||
if (orderBy instanceof NamedExpression) { | ||
orderBy = qContainer.aliases().getOrDefault(orderBy, orderBy); | ||
qContainer = qContainer | ||
.addSort(new AttributeSort(((NamedExpression) orderBy).toAttribute(), direction, missing)); | ||
} else if (orderBy.foldable() == false) { | ||
// ignore constant | ||
throw new PlanningException("does not know how to order by expression {}", orderBy); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this logic is removed. We removed the faulty support for sorting on nested fields that needs scripting, but why is this block of code associated with nested fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was especially for Scalar Functions in case the orderBy()
didn't return null -> no script creation, instead use the underlying field fo the ordering. The only use case of this was the YEAR()
function.
Previously, if YEAR() was used as and ORDER BY argument without being wrapped with another scalar (e.g. YEAR(birth_date) + 10), no script ordering was used but instead the underlying field (e.g. birth_date) was used instead as a performance optimisation. This works correctly if YEAR() is the only ORDER BY arg but if further args are used as tie breakers for the ordering wrong results are produced. This is because 2 rows with the different birth_date but on the same year are not tied as the underlying ordering is on birth_date and not on the YEAR(birth_date), and the following ORDER BY args are ignored. Remove this optimisation for YEAR() to avoid incorrect results in such cases. As a consequence another bug is revealed: scalar functions on top of nested fields produce scripted sorting/filtering which is not yet supported. In such cases no error was thrown but instead all values for such nested fields were null and were passed to the script implementing the sorting/filtering, producing incorrect results. Detect such cases and throw a validation exception. Fixes: #51224 (cherry picked from commit f41efd6)
Previously, if YEAR() was used as and ORDER BY argument without being wrapped with another scalar (e.g. YEAR(birth_date) + 10), no script ordering was used but instead the underlying field (e.g. birth_date) was used instead as a performance optimisation. This works correctly if YEAR() is the only ORDER BY arg but if further args are used as tie breakers for the ordering wrong results are produced. This is because 2 rows with the different birth_date but on the same year are not tied as the underlying ordering is on birth_date and not on the YEAR(birth_date), and the following ORDER BY args are ignored. Remove this optimisation for YEAR() to avoid incorrect results in such cases. As a consequence another bug is revealed: scalar functions on top of nested fields produce scripted sorting/filtering which is not yet supported. In such cases no error was thrown but instead all values for such nested fields were null and were passed to the script implementing the sorting/filtering, producing incorrect results. Detect such cases and throw a validation exception. Fixes: #51224 (cherry picked from commit f41efd6)
Previously, if YEAR() was used as and ORDER BY argument without being wrapped with another scalar (e.g. YEAR(birth_date) + 10), no script ordering was used but instead the underlying field (e.g. birth_date) was used instead as a performance optimisation. This works correctly if YEAR() is the only ORDER BY arg but if further args are used as tie breakers for the ordering wrong results are produced. This is because 2 rows with the different birth_date but on the same year are not tied as the underlying ordering is on birth_date and not on the YEAR(birth_date), and the following ORDER BY args are ignored. Remove this optimisation for YEAR() to avoid incorrect results in such cases. As a consequence another bug is revealed: scalar functions on top of nested fields produce scripted sorting/filtering which is not yet supported. In such cases no error was thrown but instead all values for such nested fields were null and were passed to the script implementing the sorting/filtering, producing incorrect results. Detect such cases and throw a validation exception. Fixes: #51224 (cherry picked from commit f41efd6)
Previously, if
YEAR()
was used as and ORDER BY argument without beingwrapped with another scalar (e.g.
YEAR(birth_date) + 10
), no scriptordering was used but instead the underlying field (e.g.
birth_date
)was used instead as a performance optimisation. This works correctly if
YEAR() is the only ORDER BY arg but if further args are used as tie
breakers for the ordering wrong results are produced. This is because
2 rows with the different
birth_date
but on the same year are not tiedas the underlying ordering is on
birth_date
and not on theYEAR(birth_date)
, and the following ORDER BY args are ignored.Remove this optimisation for
YEAR()
to avoid incorrect results insuch cases.
As a consequence another bug is revealed: scalar functions on top
of nested fields produce scripted sorting/filtering which is not yet
supported. In such cases no error was thrown but instead all values for
such nested fields were
null
and were passed to the script implementingthe sorting/filtering, producing incorrect results.
Detect such cases and throw a validation exception.
Fixes: #51224