Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Fix ORDER BY YEAR() function #51562

Merged
merged 7 commits into from
Jan 30, 2020
Merged

SQL: Fix ORDER BY YEAR() function #51562

merged 7 commits into from
Jan 30, 2020

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Jan 28, 2020

Previously, if YEAR() was used as and ORDER BY argument without being
wrapped with another scalar (e.g. YEAR(birth_date) + 10), no script
ordering was used but instead the underlying field (e.g. birth_date)
was used instead as a performance optimisation. This works correctly if
YEAR() is the only ORDER BY arg but if further args are used as tie
breakers for the ordering wrong results are produced. This is because
2 rows with the different birth_date but on the same year are not tied
as the underlying ordering is on birth_date and not on the
YEAR(birth_date), and the following ORDER BY args are ignored.

Remove this optimisation for YEAR() to avoid incorrect results in
such cases.

As a consequence another bug is revealed: scalar functions on top
of nested fields produce scripted sorting/filtering which is not yet
supported. In such cases no error was thrown but instead all values for
such nested fields were null and were passed to the script implementing
the sorting/filtering, producing incorrect results.

Detect such cases and throw a validation exception.

Fixes: #51224

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/SQL)

Previously, if `YEAR()` was used as and ORDER BY argument without being
wrapped with another scalar (e.g. `YEAR(birth_date) + 10`), no script
ordering was used but instead the underlying field (e.g. `birth_date`)
was used instead as a performance optimisation. This works correctly if
YEAR() is the only ORDER BY arg but if further args are used as tie
breakers for the ordering wrong results are produced. This is because
2 rows with the different `birth_date but on the same year are not tied
as the underlying ordering is on `birth_date` and not on the
`YEAR(birth_date)`, and the following ORDER BY args are ignored.

Remove this optimisation for `YEAR()` to avoid incorrect results in
such cases.

Fixes: elastic#51224
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the tests has wrong results.

Chirstian |Koblick
null |Chappelet
Zvonko |Nyanchama
Parto |Bamford
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not relevant (anymore). Can you add dep.from_date to the projections?
Why did the order change since the PR should address ONLY the case of ordering by year and something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order can change because previously we would have a strict order on the whole from_date (including day and month) but now simply on the YEAR, so (apart from the YEAR) can be affected by the order of the doc entries when inserted to the index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so how does the query dsl look like now? Previously I think it contained a "sort" : [{"from_date" : {"order" : "asc"}}] section. We don't have tests that check the generated ES query with a sort by a date field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, since a scalar is used, it needs to be translated into a scripted sorting, which currently is not supported for nested fields.

Chirstian |Koblick |1986
null |Chappelet |1988
Zvonko |Nyanchama |1989
Parto |Bamford |1995
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering here is completely wrong. 1995, 1986, 1996.... The query has ORDER BY start where start is YEAR(dep.from_date).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooops! I rushed here, sorry, need to check it. Thx for catching!

@matriv
Copy link
Contributor Author

matriv commented Jan 29, 2020

Seems that with the fix, YEAR() becomes always a script, and we cannot have scripted sorting on nested fields, It's a current restriction in ES.

My suggestion is to catch this in the Verifier and not allow to use scalars on nested fields.

@matriv matriv requested a review from astefan January 29, 2020 17:22
@matriv
Copy link
Contributor Author

matriv commented Jan 29, 2020

As discussed, added a detection to prevent usage of scalars on nested fields in ORDER BY.
I also opened an issue to enhance the error messages: #51636
as I tried to improve it but seems not so trivial.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// used if the function is monotonic and thus does not have to be computed for ordering purposes
// null means the script needs to be used; expression means the field/expression to be used instead
// Every function needs to be translated to a script
// which will be used for the ordering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the method can be removed entirely - is it used anywhere else?

List<FieldAttribute> nested = new ArrayList<>();
Consumer<FieldAttribute> match = fa -> {
Consumer<FieldAttribute> match = (fa) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why (fa)?

if (fa.isNested()) {
nested.add(fa);
}
};

// nested fields shouldn't be used in aggregates or having (yet)
p.forEachDown(a -> a.groupings().forEach(agg -> agg.forEachUp(match, FieldAttribute.class)), Aggregate.class);
p.forEachDown(a -> a.groupings().forEach(agg -> agg.forEachUp(e ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit - this type of call seems to be used 3 times (last time twice) - maybe could be extracted into a small method to avoid repetition.
Maybe check if it's used anywhere else so even more of a reuse...

@costin
Copy link
Member

costin commented Jan 29, 2020

Please add also a section in the docs mentioning that scalar functions cannot be used in ordering on nested fields. Maybe even filtering...

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
But please, add a test to QueryTranslatorTests for ORDER BY YEAR() without GROUPing BY YEAR(). This PR proves that a small enough change (the overriden orderBy() method removal in Year) can change significantly the generated ES query.

@matriv matriv requested a review from astefan January 30, 2020 11:47
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few comments.


[source, sql]
--------------------------------------------------
SELECT * FROM test_emp WHERE length(dep.dep_name.keyword) > 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capital letters for length?

if (fa.isNested()) {
nested.add(fa);
}
};
Consumer<Expression> checkForNested = e ->
attributeRefs.getOrDefault(e, e).forEachUp(matchNested, FieldAttribute.class);
Consumer<ScalarFunction> checkForNestedInFcuntion = f -> f.arguments().forEach(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the variable name: checkForNestedInFcuntion .

Comment on lines -699 to -709
if (sf.orderBy() != null) {
Expression orderBy = sf.orderBy();
if (orderBy instanceof NamedExpression) {
orderBy = qContainer.aliases().getOrDefault(orderBy, orderBy);
qContainer = qContainer
.addSort(new AttributeSort(((NamedExpression) orderBy).toAttribute(), direction, missing));
} else if (orderBy.foldable() == false) {
// ignore constant
throw new PlanningException("does not know how to order by expression {}", orderBy);
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this logic is removed. We removed the faulty support for sorting on nested fields that needs scripting, but why is this block of code associated with nested fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was especially for Scalar Functions in case the orderBy() didn't return null -> no script creation, instead use the underlying field fo the ordering. The only use case of this was the YEAR() function.

@matriv matriv removed the v7.5.3 label Jan 30, 2020
@matriv matriv merged commit f41efd6 into elastic:master Jan 30, 2020
@matriv matriv deleted the fix-51224 branch January 30, 2020 13:48
matriv added a commit that referenced this pull request Jan 30, 2020
Previously, if YEAR() was used as and ORDER BY argument without being
wrapped with another scalar (e.g. YEAR(birth_date) + 10), no script
ordering was used but instead the underlying field (e.g. birth_date)
was used instead as a performance optimisation. This works correctly if
YEAR() is the only ORDER BY arg but if further args are used as tie
breakers for the ordering wrong results are produced. This is because
2 rows with the different birth_date but on the same year are not tied
as the underlying ordering is on birth_date and not on the
YEAR(birth_date), and the following ORDER BY args are ignored.

Remove this optimisation for YEAR() to avoid incorrect results in
such cases.

As a consequence another bug is revealed: scalar functions on top
of nested fields produce scripted sorting/filtering which is not yet
supported. In such cases no error was thrown but instead all values for
such nested fields were null and were passed to the script implementing
the sorting/filtering, producing incorrect results.

Detect such cases and throw a validation exception.

Fixes: #51224
(cherry picked from commit f41efd6)
matriv added a commit that referenced this pull request Jan 30, 2020
Previously, if YEAR() was used as and ORDER BY argument without being
wrapped with another scalar (e.g. YEAR(birth_date) + 10), no script
ordering was used but instead the underlying field (e.g. birth_date)
was used instead as a performance optimisation. This works correctly if
YEAR() is the only ORDER BY arg but if further args are used as tie
breakers for the ordering wrong results are produced. This is because
2 rows with the different birth_date but on the same year are not tied
as the underlying ordering is on birth_date and not on the
YEAR(birth_date), and the following ORDER BY args are ignored.

Remove this optimisation for YEAR() to avoid incorrect results in
such cases.

As a consequence another bug is revealed: scalar functions on top
of nested fields produce scripted sorting/filtering which is not yet
supported. In such cases no error was thrown but instead all values for
such nested fields were null and were passed to the script implementing
the sorting/filtering, producing incorrect results.

Detect such cases and throw a validation exception.

Fixes: #51224
(cherry picked from commit f41efd6)
matriv added a commit that referenced this pull request Jan 30, 2020
Previously, if YEAR() was used as and ORDER BY argument without being
wrapped with another scalar (e.g. YEAR(birth_date) + 10), no script
ordering was used but instead the underlying field (e.g. birth_date)
was used instead as a performance optimisation. This works correctly if
YEAR() is the only ORDER BY arg but if further args are used as tie
breakers for the ordering wrong results are produced. This is because
2 rows with the different birth_date but on the same year are not tied
as the underlying ordering is on birth_date and not on the
YEAR(birth_date), and the following ORDER BY args are ignored.

Remove this optimisation for YEAR() to avoid incorrect results in
such cases.

As a consequence another bug is revealed: scalar functions on top
of nested fields produce scripted sorting/filtering which is not yet
supported. In such cases no error was thrown but instead all values for
such nested fields were null and were passed to the script implementing
the sorting/filtering, producing incorrect results.

Detect such cases and throw a validation exception.

Fixes: #51224
(cherry picked from commit f41efd6)
@polyfractal polyfractal added v7.6.0 and removed v7.6.1 labels Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: ORDER BY field and function will order by field only
6 participants