Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ES|QL] String literal implicit casting #106932

Merged
merged 24 commits into from
Apr 11, 2024

Conversation

fang-xing-esql
Copy link
Member

@fang-xing-esql fang-xing-esql commented Mar 29, 2024

Cast string literals to expected data types of ESQL scalar functions and arithmetic operations automatically.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a small round of comments. This is shaping up nicely. Please check its impact on the rest of the code and look at removing to string conversion from various functions and see if casting works.

@fang-xing-esql fang-xing-esql marked this pull request as ready for review April 2, 2024 15:17
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@fang-xing-esql fang-xing-esql requested a review from nik9000 April 2, 2024 15:17
@fang-xing-esql
Copy link
Member Author

Left a small round of comments. This is shaping up nicely. Please check its impact on the rest of the code and look at removing to string conversion from various functions and see if casting works.

Thanks for reviewing! One place that we do string conversation in an EsqlScalarFunction is AutoBucket, the 3rd and 4th inputs are converted from string to datetime inside the function, only if its 1st input is a datetime. If its 1st input is numeric, we don't convert. Since the conversion depends on another argument, we might just leave it inside the AutoBucket function.

@costin costin requested a review from astefan April 5, 2024 17:18
Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment is around still using string/keyword for AutoBucket - otherwise LGTM.

@costin costin added the ES|QL-ui Impacts ES|QL UI label Apr 5, 2024
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments but, most importantly, this PR changes at least one current behavior into a not-ideal manner.
The following query from employees | EVAL k = hire_date > DATE_FORMAT(\"YYYY-MM-dd\", \"1989-06-02T00:00:00.000Z\") | keep languages, hire_date (I didn't spend time in refining it) generates a user friendly error message in main:

                "type": "verification_exception",
                "reason": "Found 1 problem\nline 1:39: second argument of [DATE_FORMAT(\"YYYY-MM-dd\", \"1989-06-02T00:00:00.000Z\")] must be [datetime], found value [\"1989-06-02T00:00:00.000Z\"] type [keyword]",
                "stack_trace": "org.elasticsearch.xpack.esql.VerificationException: Found 1 problem\nline 1:39: second argument of [DATE_FORMAT(\"YYYY-MM-dd\", \"1989-06-02T00:00:00.000Z\")] must be [datetime], found value [\"1989-06-02T00:00:00.000Z\"] type [keyword]\r\n\tat org.elasticsearch.xpack.esql.analysis.Analyzer.verify(Analyzer.java:133)

while with this PR the error is not pretty

path: /_query, params: {format=txt, error_trace=true}, status: 500 java.lang.ClassCastException: class java.lang.String cannot be cast to class org.apache.lucene.util.BytesRef (java.lang.String is in module java.base of loader 'bootstrap'; org.apache.lucene.util.BytesRef is in module [email protected] of loader 'app')
        at org.elasticsearch.xpack.esql.expression.function.scalar.date.DateFormat.toFormatter(DateFormat.java:114)
        at org.elasticsearch.xpack.esql.expression.function.scalar.date.DateFormat.toEvaluator(DateFormat.java:106)
        at org.elasticsearch.xpack.esql.evaluator.mapper.EvaluatorMapper.fold(EvaluatorMapper.java:48)
        at org.elasticsearch.xpack.esql.expression.function.scalar.EsqlScalarFunction.fold(EsqlScalarFunction.java:30)
        at org.elasticsearch.xpack.esql.analysis.Analyzer$PromoteStringsInDateComparisons.stringToDate(Analyzer.java:817)
        at org.elasticsearch.xpack.esql.analysis.Analyzer$PromoteStringsInDateComparisons.promote(Analyzer.java:802)
        at org.elasticsearch.xpack.ql.tree.Node.lambda$transformUp$11(Node.java:198)
        at org.elasticsearch.xpack.ql.tree.Node.transformUp(Node.java:192)

@fang-xing-esql
Copy link
Member Author

Left some comments but, most importantly, this PR changes at least one current behavior into a not-ideal manner. The following query from employees | EVAL k = hire_date > DATE_FORMAT(\"YYYY-MM-dd\", \"1989-06-02T00:00:00.000Z\") | keep languages, hire_date (I didn't spend time in refining it) generates a user friendly error message in main:

                "type": "verification_exception",
                "reason": "Found 1 problem\nline 1:39: second argument of [DATE_FORMAT(\"YYYY-MM-dd\", \"1989-06-02T00:00:00.000Z\")] must be [datetime], found value [\"1989-06-02T00:00:00.000Z\"] type [keyword]",
                "stack_trace": "org.elasticsearch.xpack.esql.VerificationException: Found 1 problem\nline 1:39: second argument of [DATE_FORMAT(\"YYYY-MM-dd\", \"1989-06-02T00:00:00.000Z\")] must be [datetime], found value [\"1989-06-02T00:00:00.000Z\"] type [keyword]\r\n\tat org.elasticsearch.xpack.esql.analysis.Analyzer.verify(Analyzer.java:133)

while with this PR the error is not pretty

                "type": "verification_exception",
                "reason": "Found 1 problem\nline 1:39: second argument of [DATE_FORMAT(\"YYYY-MM-dd\", \"1989-06-02T00:00:00.000Z\")] must be [datetime], found value [\"1989-06-02T00:00:00.000Z\"] type [keyword]",
                "stack_trace": "org.elasticsearch.xpack.esql.VerificationException: Found 1 problem\nline 1:39: second argument of [DATE_FORMAT(\"YYYY-MM-dd\", \"1989-06-02T00:00:00.000Z\")] must be [datetime], found value [\"1989-06-02T00:00:00.000Z\"] type [keyword]\r\n\tat org.elasticsearch.xpack.esql.analysis.Analyzer.verify(Analyzer.java:133)

Thank you for catching this issue!

After casting the second inputs to date_format to date with the ImplicitCasting rule, type resolution passed for date_format, and it is foldable now, however a new error is thrown from the PromoteStringsInDateComparisons rule. The left hand side hire_date is a date type, the output from date_format is a string type, this is why it errors out.

Currently the auto-casting in ImplicitCasting and PromoteStringsInDateComparisons support string literal only, it does not cast the string output of scalar functions. The scalar functions will be evaluated/fold during logical plan optimizer, so it is too early to cast the output of date_format to date type during resolution phase.

The query below errors out on main with the same error at the problem query above on the branch:
from employees | EVAL k = hire_date > DATE_FORMAT("YYYY-MM-dd", now()) | keep languages, hire_date;

java.lang.ClassCastException: class java.lang.String cannot be cast to class org.apache.lucene.util.BytesRef (java.lang.String is in module java.base of loader 'bootstrap'; org.apache.lucene.util.BytesRef is in unnamed module of loader 'app')

I could add a check in PromoteStringsInDateComparisons to cast only literals, and skip functions. Perhaps I should add the same check in ImplicitCasting for nested functions.

@astefan
Copy link
Contributor

astefan commented Apr 8, 2024

@fang-xing-esql Now I realized I didn't copy-pasted the right exception for what happens in this PR. Sorry about that and thank you for testing and confirming in your reply the behavior. I've updated my comment above.


return types.stream()
.min((dt1, dt2) -> dataTypeCastingPriority.get(dt1).compareTo(dataTypeCastingPriority.get(dt2)))
.orElse(DataTypes.NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be NULL - if you leave off orElse then you'll get an Optional type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try if UNSUPPORTED makes sense.

@fang-xing-esql fang-xing-esql merged commit 0075c1f into elastic:main Apr 11, 2024
14 checks passed
@@ -34,246 +34,6 @@
"variadic" : false,
"returnType" : "datetime"
},
{
"params" : [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drewdaemon I've just realized you'll need this too. Here's the deal - we have automatic promotion of string literals. For this particular function these always had to be literals anyway so we no longer think of these as valid signatures.

This is an interesting wrinkle. We don't send anything telling you this has to be a literal. We probably should.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nik9000 thanks for thinking of us! Yes, we currently have this information hard-coded on our end. If there's a way to surface this in the definitions y'all provide us... all the better!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could generate something like:

       {
          "name" : "from",
          "type" : "keyword",
          "optional" : false,
          "description" : "",
          "only_literals": "date"
        },

Or something like that. But in general if something says datetime now it can take a literal string that is formatted like a date. Same for numbers. But that's quite a lot to add to your editor, especially right now.

There are other cases where we send you "only_literals": "any" for bits that take any literal. The buckets parameter here only takes literal integers. No fields allowed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this kind of thing would be great. I have put elastic/kibana#179634 a bit on hold while I squish editor bugs for the 8.14 release. Once that settles down, I'm going to work on pulling these generated definitions into Kibana "for real."

At that point, I think I'll have a clearer understanding of exactly what structure would work best.

But in general if something says datetime now it can take a literal string that is formatted like a date. Same for numbers.

Noted, thanks!

datetime | integer | keyword | text | datetime
datetime | integer | text | datetime | datetime
datetime | integer | text | keyword | datetime
datetime | integer | text | text | datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. We no longer tell folks they can use strings here, only dates. But maybe a human is ok because the examples will contain string literals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct, though? The function does accept the string types (the foldToLong() logic remained), but generally, wouldn't we have to rather extend the function infos of scalar/arithmetic functions to include these types, now that we autocast?

Copy link
Member Author

@fang-xing-esql fang-xing-esql Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the implicit casting is done at resolution phase. We can do auto casting if the string literals are provided as input to the bucket function directly, e.g.
BUCKET(date, 20, "1985-01-01T00:00:00Z", "1986-01-01T00:00:00Z"). --> auto casting convert from and to to datetime

However, if the string literals are provided as input to the bucket function through a variable, auto casting does not apply, e.g.
| EVAL bucket_start = "1999-01-01T00:00:00Z". --> auto casting does not apply
| EVAL bucket_end = NOW()
| EVAL bucket = AUTO_BUCKET(hire_date, 5, bucket_start, bucket_end)

This is the main reason that there are still code inside to function to do the conversion, as the foldable eval propagation happens at logical planner. We have some options to make auto casting work better, like:

  • apply the implicit casting rule in logical planner after foldable eval propagation
  • apply foldable eval propagation in resolution phase (I brought up this idea before, probably not a good one)

There are some follow up work related to the auto casting. A lot(perhaps most) of the rules in analyzer and logical planner are related to query transformation, there are chances that we can rearrange the rules to make resolution and query transformation work more effectively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to whether this is right or not, I dunno. You certainly can only put date-like-strings in it. And now that we support implicit conversion for these literals everywhere where this list says "date" it means "date or date-like-string". I dunno.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main reason that there are still code inside to function to do the conversion

Sure, that's fine. I was mostly commenting on adjusting the function info to remove the keyword and text types as accepted by the function (and the docs changes that ensued).

craigtaverner pushed a commit to craigtaverner/elasticsearch that referenced this pull request Apr 11, 2024
* string literal casting for scalar functions and arithmetic operations.
drewdaemon added a commit to elastic/kibana that referenced this pull request May 8, 2024
## Summary

## String constants in place of dates

As of elastic/elasticsearch#106932 string
constants can always be used in place of dates.

- `| eval date_extract("DAY_OF_WEEK", "2024-05-07T13:20:40.554Z")`
- `| STATS AVG(salary) BY bucket = BUCKET(hire_date, 20,
"1985-01-01T00:00:00Z", "1986-01-01T00:00:00Z")
| SORT bucket`
- `| eval date_diff("day", "2021-01-01", "2022-01-01")`

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <[email protected]>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request May 8, 2024
## Summary

## String constants in place of dates

As of elastic/elasticsearch#106932 string
constants can always be used in place of dates.

- `| eval date_extract("DAY_OF_WEEK", "2024-05-07T13:20:40.554Z")`
- `| STATS AVG(salary) BY bucket = BUCKET(hire_date, 20,
"1985-01-01T00:00:00Z", "1986-01-01T00:00:00Z")
| SORT bucket`
- `| eval date_diff("day", "2021-01-01", "2022-01-01")`

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <[email protected]>
(cherry picked from commit 9ddc3c6)
kibanamachine added a commit to elastic/kibana that referenced this pull request May 8, 2024
# Backport

This will backport the following commits from `main` to `8.14`:
- [[ES|QL] accept string constants for date args
(#182856)](#182856)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Drew
Tate","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-05-08T13:53:47Z","message":"[ES|QL]
accept string constants for date args (#182856)\n\n## Summary\r\n\r\n##
String constants in place of dates\r\n\r\nAs of
elastic/elasticsearch#106932 string\r\nconstants
can always be used in place of dates.\r\n\r\n- `| eval
date_extract(\"DAY_OF_WEEK\", \"2024-05-07T13:20:40.554Z\")`\r\n- `|
STATS AVG(salary) BY bucket = BUCKET(hire_date,
20,\r\n\"1985-01-01T00:00:00Z\", \"1986-01-01T00:00:00Z\")\r\n| SORT
bucket`\r\n- `| eval date_diff(\"day\", \"2021-01-01\",
\"2022-01-01\")`\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"9ddc3c6aa836e9e5d3bdd30512f7de55d62a16d2","branchLabelMapping":{"^v8.15.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport:prev-minor","Feature:ES|QL","v8.14.0","Team:ESQL","v8.15.0"],"title":"[ES|QL]
accept string constants for date
args","number":182856,"url":"https://github.com/elastic/kibana/pull/182856","mergeCommit":{"message":"[ES|QL]
accept string constants for date args (#182856)\n\n## Summary\r\n\r\n##
String constants in place of dates\r\n\r\nAs of
elastic/elasticsearch#106932 string\r\nconstants
can always be used in place of dates.\r\n\r\n- `| eval
date_extract(\"DAY_OF_WEEK\", \"2024-05-07T13:20:40.554Z\")`\r\n- `|
STATS AVG(salary) BY bucket = BUCKET(hire_date,
20,\r\n\"1985-01-01T00:00:00Z\", \"1986-01-01T00:00:00Z\")\r\n| SORT
bucket`\r\n- `| eval date_diff(\"day\", \"2021-01-01\",
\"2022-01-01\")`\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"9ddc3c6aa836e9e5d3bdd30512f7de55d62a16d2"}},"sourceBranch":"main","suggestedTargetBranches":["8.14"],"targetPullRequestStates":[{"branch":"8.14","label":"v8.14.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.15.0","branchLabelMappingKey":"^v8.15.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/182856","number":182856,"mergeCommit":{"message":"[ES|QL]
accept string constants for date args (#182856)\n\n## Summary\r\n\r\n##
String constants in place of dates\r\n\r\nAs of
elastic/elasticsearch#106932 string\r\nconstants
can always be used in place of dates.\r\n\r\n- `| eval
date_extract(\"DAY_OF_WEEK\", \"2024-05-07T13:20:40.554Z\")`\r\n- `|
STATS AVG(salary) BY bucket = BUCKET(hire_date,
20,\r\n\"1985-01-01T00:00:00Z\", \"1986-01-01T00:00:00Z\")\r\n| SORT
bucket`\r\n- `| eval date_diff(\"day\", \"2021-01-01\",
\"2022-01-01\")`\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"9ddc3c6aa836e9e5d3bdd30512f7de55d62a16d2"}}]}]
BACKPORT-->

Co-authored-by: Drew Tate <[email protected]>
@fang-xing-esql fang-xing-esql deleted the auto-casting-functions branch May 9, 2024 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement ES|QL-ui Impacts ES|QL UI Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants