Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't return TEXT type for functions that take TEXT #114334

Merged

Conversation

craigtaverner
Copy link
Contributor

@craigtaverner craigtaverner commented Oct 8, 2024

Always return KEYWORD for functions that previously returned TEXT, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the TEXT field value. In principle, if the attribute is no longer the original FieldAttribute, it can no longer claim to have the type TEXT.

This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced KEYWORD for TEXT input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produce TEXT, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to be TEXT.

One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like Concat, but is now also the case for Greatest, Least, Case, Coalesce and MvAppend.

An associated change is that the type casting operator ::text has been entirely removed. It used to map onto the ToString function which returned type KEYWORD, and so ::text really produced a KEYWORD, which is a lie, or at least a bug, which is now fixed. Should we ever wish to actually produce real TEXT, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again).

Backwards compatibility issues:

This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios:

  • We used the capability functions_never_emit_text to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text.
  • We used skipTest to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14).

In all cases the change observed in these tests was simply the results columns no longer having text type, and instead being keyword.

Kibana

Kibana makes use of some JSON files we produce during tests runs that explain the signatures of the functions, and the available type-casting operators. These are now edited to remove all TEXT output, and remove the ::test operator.

Fixes

Fixes #114333
Fixes #111537

Always return KEYWORD, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the TEXT value.
@craigtaverner craigtaverner added >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged v8.15.3 labels Oct 8, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @craigtaverner, I've created a changelog YAML for you.

@nik9000
Copy link
Member

nik9000 commented Oct 8, 2024

  • Change the parent class UnaryScalarFunction to always return KEYWORD for TEXT fields (done in this PR to cover functions not yet fixed)

I think I'd prefer if UnaryScalarFunction didn't have a default return type. That's a bigger change though because everyone's got to fill in their return type.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good step forward, but imo we could do better.
This needs a general evaluation of the situation because text as an output data type of our own functions is confusing, meaningless and not so useful at this point in the life of the language. So:

  • all functions need to be checked. I've caught max(text) which returns text. If we are to make this change, let's make it general.
  • to catch any misses, we should have a check in a central location for return data types of any kind of function/operator/anything-else (maybe after the logical planner is done, not sure). Maybe with an assert so that we don't fail queries at runtime, unless it's CI/tests.
  • another way of looking at the problem is that text (as a data type) does only make sense for FieldAttributes? (meaning we can accept text fields, we can do stuff with them, but creating a text shouldn't be possible)

@astefan
Copy link
Contributor

astefan commented Oct 9, 2024

  • Change the parent class UnaryScalarFunction to always return KEYWORD for TEXT fields (done in this PR to cover functions not yet fixed)

I think I'd prefer if UnaryScalarFunction didn't have a default return type. That's a bigger change though because everyone's got to fill in their return type.

++
Also, UnaryScalarFunction is not the only function in the class' hierarchy.

@nik9000
Copy link
Member

nik9000 commented Oct 9, 2024

central location

I was thinking that the tests for functions could assert that they never spit out text. I think you could override that when we go to implement ANALYZE(kwd, FRENCH) or whatever we call it, but only a hand full of function will ever output text fields and I'm ok with extra work telling you "you probably don't want to return text here" when writing them.

@@ -80,7 +82,8 @@ public Values replaceChildren(List<Expression> newChildren) {

@Override
public DataType dataType() {
return field().dataType();
DataType t = field().dataType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this for every aggregation function, how about going to AggregateFunction and define there something similar to UnaryScalarFunction.dataType()?

We pull this text from 8.x and run it on current builds in yamlRestCompatTest, so we need to disable this check here (and in 8.x)
@@ -1435,7 +1435,7 @@ public static TestCase typeError(List<TypedData> data, String expectedTypeError)
this.source = Source.EMPTY;
this.data = data;
this.evaluatorToString = evaluatorToString;
this.expectedType = expectedType;
this.expectedType = expectedType == null ? null : expectedType.noText();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're already adding the .noText() to the test cases (like MaxTests), why are we adding it here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the comprehensive 'catch all' case we added late in the PR once we had covered all the low hanging fruit. But perhaps this means we can remove some of the other earlier code. I'll investigate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed there were 12 places we had added this noText() that no longer needed it, since they all called into this TestCase constructor. I've removed them all and will push a commit after re-running tests to be safe.

@@ -117,7 +117,6 @@ public class EsqlDataTypeConverter {
entry(LONG, ToLong::new),
// ToRadians, typeless
entry(KEYWORD, ToString::new),
entry(TEXT, ToString::new),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may close/cancel #111537 (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I was busy creating a new issue for that, but glad to see you already did, way back! I'll add a 'fixes' line to the PR description.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looking much better. Thanks, Craig.

Copy link
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@elasticsearchmachine
Copy link
Collaborator

Hi @craigtaverner, I've updated the changelog YAML for you.

Since we added a check for this inside the TestCase constructor, all these extra calls are no longer needed.
…verner/elasticsearch into esql_functions_never_return_text
Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@@ -584,6 +584,10 @@ static Builder builder() {
return new Builder();
}

public DataType noText() {
return this == DataType.TEXT ? DataType.KEYWORD : this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
return this == DataType.TEXT ? DataType.KEYWORD : this;
return this == TEXT ? KEYWORD : this;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 112 to 113
DataType t = field().dataType();
return t == TEXT ? KEYWORD : t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataType t = field().dataType();
return t == TEXT ? KEYWORD : t;
return field().dataType().noText();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I normally like to accept the commit suggestion, but in this case the remaining imports would cause a build failure, so I did this locally and will push a commit for all three places soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good, it's not like i really contributed :)

@@ -584,6 +584,10 @@ static Builder builder() {
return new Builder();
}

public DataType noText() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

textAsKeyword()? noText is compact, but I find it a tad misleading. Minor preference, can stay as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is sound, but I have a minor preference for noText as textAsKeyword also seems to say something about the current datatype being TEXT; which it likely is not.

Comment on lines 180 to 181
DataType t = field().dataType();
return t == TEXT ? KEYWORD : t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noText()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -91,7 +93,8 @@ public Values withFilter(Expression filter) {

@Override
public DataType dataType() {
return field().dataType();
DataType t = field().dataType();
return t == TEXT ? KEYWORD : t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noText()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@craigtaverner craigtaverner merged commit 3d307e0 into elastic:main Oct 25, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Oct 25, 2024
Always return `KEYWORD` for functions that previously returned `TEXT`, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the `TEXT` field value. In principle, if the attribute is no longer the original `FieldAttribute`, it can no longer claim to have the type `TEXT`.

This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced `KEYWORD` for `TEXT` input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produce `TEXT`, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to be `TEXT`.

One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like `Concat`, but is now also the case for `Greatest`, `Least`, `Case`, `Coalesce` and `MvAppend`.

An associated change is that the type casting operator `::text` has been entirely removed. It used to map onto the `ToString` function which returned type KEYWORD, and so `::text` really produced a `KEYWORD`, which is a lie, or at least a `bug`, which is now fixed. Should we ever wish to actually produce real `TEXT`, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again).

### Backwards compatibility issues:

This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios:

* We used the capability `functions_never_emit_text` to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text.
* We used `skipTest` to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14).

In all cases the change observed in these tests was simply the results columns no longer having `text` type, and instead being `keyword`.

---------

Co-authored-by: Luigi Dell'Aquila <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Oct 25, 2024
Always return `KEYWORD` for functions that previously returned `TEXT`, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the `TEXT` field value. In principle, if the attribute is no longer the original `FieldAttribute`, it can no longer claim to have the type `TEXT`.

This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced `KEYWORD` for `TEXT` input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produce `TEXT`, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to be `TEXT`.

One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like `Concat`, but is now also the case for `Greatest`, `Least`, `Case`, `Coalesce` and `MvAppend`.

An associated change is that the type casting operator `::text` has been entirely removed. It used to map onto the `ToString` function which returned type KEYWORD, and so `::text` really produced a `KEYWORD`, which is a lie, or at least a `bug`, which is now fixed. Should we ever wish to actually produce real `TEXT`, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again).

### Backwards compatibility issues:

This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios:

* We used the capability `functions_never_emit_text` to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text.
* We used `skipTest` to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14).

In all cases the change observed in these tests was simply the results columns no longer having `text` type, and instead being `keyword`.

---------

Co-authored-by: Luigi Dell'Aquila <[email protected]>
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
Always return `KEYWORD` for functions that previously returned `TEXT`, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the `TEXT` field value. In principle, if the attribute is no longer the original `FieldAttribute`, it can no longer claim to have the type `TEXT`.

This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced `KEYWORD` for `TEXT` input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produce `TEXT`, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to be `TEXT`.

One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like `Concat`, but is now also the case for `Greatest`, `Least`, `Case`, `Coalesce` and `MvAppend`.

An associated change is that the type casting operator `::text` has been entirely removed. It used to map onto the `ToString` function which returned type KEYWORD, and so `::text` really produced a `KEYWORD`, which is a lie, or at least a `bug`, which is now fixed. Should we ever wish to actually produce real `TEXT`, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again).

### Backwards compatibility issues:

This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios:

* We used the capability `functions_never_emit_text` to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text.
* We used `skipTest` to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14).

In all cases the change observed in these tests was simply the results columns no longer having `text` type, and instead being `keyword`.

---------

Co-authored-by: Luigi Dell'Aquila <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Dec 23, 2024
TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that
could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test
like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed
collaterally by #114334 (8.17.0)

Also, correct some of the tests skipping (that had however no impact,
due to testing range).
bpintea added a commit to bpintea/elasticsearch that referenced this pull request Dec 23, 2024
TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that
could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test
like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed
collaterally by elastic#114334 (8.17.0)

Also, correct some of the tests skipping (that had however no impact,
due to testing range).

(cherry picked from commit edb3818)
elasticsearchmachine pushed a commit that referenced this pull request Dec 23, 2024
* ESQL: Rewrite TO_UPPER/TO_LOWER comparisons (#118870)

This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons
against a string into an InsensitiveEquals comparison. The rewrite can
also result right away into a TRUE/FALSE, in case the string doesn't
match the caseness of the function.

This also allows later pushing down the predicate to lucene as a
case-insensitive term-query.

Fixes #118304.

* Disable `TO_UPPER(null)`-tests prior to 8.17 (#119213)

TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that
could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test
like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed
collaterally by #114334 (8.17.0)

Also, correct some of the tests skipping (that had however no impact,
due to testing range).

(cherry picked from commit edb3818)
navarone-feekery pushed a commit to navarone-feekery/elasticsearch that referenced this pull request Dec 26, 2024
TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that
could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test
like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed
collaterally by elastic#114334 (8.17.0)

Also, correct some of the tests skipping (that had however no impact,
due to testing range).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug ES|QL-ui Impacts ES|QL UI Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.17.0 v9.0.0
Projects
None yet