-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't return TEXT type for functions that take TEXT #114334
Don't return TEXT type for functions that take TEXT #114334
Conversation
Always return KEYWORD, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the TEXT value.
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @craigtaverner, I've created a changelog YAML for you. |
…verner/elasticsearch into esql_functions_never_return_text
I think I'd prefer if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good step forward, but imo we could do better.
This needs a general evaluation of the situation because text
as an output data type of our own functions is confusing, meaningless and not so useful at this point in the life of the language. So:
- all functions need to be checked. I've caught
max(text)
which returnstext
. If we are to make this change, let's make it general. - to catch any misses, we should have a check in a central location for return data types of any kind of function/operator/anything-else (maybe after the logical planner is done, not sure). Maybe with an
assert
so that we don't fail queries at runtime, unless it's CI/tests. - another way of looking at the problem is that
text
(as a data type) does only make sense for FieldAttributes? (meaning we can accepttext
fields, we can do stuff with them, but creating atext
shouldn't be possible)
++ |
I was thinking that the tests for functions could assert that they never spit out |
@@ -80,7 +82,8 @@ public Values replaceChildren(List<Expression> newChildren) { | |||
|
|||
@Override | |||
public DataType dataType() { | |||
return field().dataType(); | |||
DataType t = field().dataType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this for every aggregation function, how about going to AggregateFunction
and define there something similar to UnaryScalarFunction.dataType()
?
We pull this text from 8.x and run it on current builds in yamlRestCompatTest, so we need to disable this check here (and in 8.x)
@@ -1435,7 +1435,7 @@ public static TestCase typeError(List<TypedData> data, String expectedTypeError) | |||
this.source = Source.EMPTY; | |||
this.data = data; | |||
this.evaluatorToString = evaluatorToString; | |||
this.expectedType = expectedType; | |||
this.expectedType = expectedType == null ? null : expectedType.noText(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're already adding the .noText() to the test cases (like MaxTests), why are we adding it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the comprehensive 'catch all' case we added late in the PR once we had covered all the low hanging fruit. But perhaps this means we can remove some of the other earlier code. I'll investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed there were 12 places we had added this noText()
that no longer needed it, since they all called into this TestCase constructor. I've removed them all and will push a commit after re-running tests to be safe.
@@ -117,7 +117,6 @@ public class EsqlDataTypeConverter { | |||
entry(LONG, ToLong::new), | |||
// ToRadians, typeless | |||
entry(KEYWORD, ToString::new), | |||
entry(TEXT, ToString::new), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change may close/cancel #111537 (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I was busy creating a new issue for that, but glad to see you already did, way back! I'll add a 'fixes' line to the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Looking much better. Thanks, Craig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Hi @craigtaverner, I've updated the changelog YAML for you. |
Since we added a check for this inside the TestCase constructor, all these extra calls are no longer needed.
…verner/elasticsearch into esql_functions_never_return_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
@@ -584,6 +584,10 @@ static Builder builder() { | |||
return new Builder(); | |||
} | |||
|
|||
public DataType noText() { | |||
return this == DataType.TEXT ? DataType.KEYWORD : this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
return this == DataType.TEXT ? DataType.KEYWORD : this; | |
return this == TEXT ? KEYWORD : this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
DataType t = field().dataType(); | ||
return t == TEXT ? KEYWORD : t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataType t = field().dataType(); | |
return t == TEXT ? KEYWORD : t; | |
return field().dataType().noText(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I normally like to accept the commit suggestion, but in this case the remaining imports would cause a build failure, so I did this locally and will push a commit for all three places soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all good, it's not like i really contributed :)
@@ -584,6 +584,10 @@ static Builder builder() { | |||
return new Builder(); | |||
} | |||
|
|||
public DataType noText() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
textAsKeyword()
? noText
is compact, but I find it a tad misleading. Minor preference, can stay as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is sound, but I have a minor preference for noText
as textAsKeyword
also seems to say something about the current datatype being TEXT; which it likely is not.
DataType t = field().dataType(); | ||
return t == TEXT ? KEYWORD : t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noText()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -91,7 +93,8 @@ public Values withFilter(Expression filter) { | |||
|
|||
@Override | |||
public DataType dataType() { | |||
return field().dataType(); | |||
DataType t = field().dataType(); | |||
return t == TEXT ? KEYWORD : t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noText()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
💚 Backport successful
|
Always return `KEYWORD` for functions that previously returned `TEXT`, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the `TEXT` field value. In principle, if the attribute is no longer the original `FieldAttribute`, it can no longer claim to have the type `TEXT`. This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced `KEYWORD` for `TEXT` input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produce `TEXT`, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to be `TEXT`. One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like `Concat`, but is now also the case for `Greatest`, `Least`, `Case`, `Coalesce` and `MvAppend`. An associated change is that the type casting operator `::text` has been entirely removed. It used to map onto the `ToString` function which returned type KEYWORD, and so `::text` really produced a `KEYWORD`, which is a lie, or at least a `bug`, which is now fixed. Should we ever wish to actually produce real `TEXT`, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again). ### Backwards compatibility issues: This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios: * We used the capability `functions_never_emit_text` to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text. * We used `skipTest` to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14). In all cases the change observed in these tests was simply the results columns no longer having `text` type, and instead being `keyword`. --------- Co-authored-by: Luigi Dell'Aquila <[email protected]>
Always return `KEYWORD` for functions that previously returned `TEXT`, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the `TEXT` field value. In principle, if the attribute is no longer the original `FieldAttribute`, it can no longer claim to have the type `TEXT`. This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced `KEYWORD` for `TEXT` input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produce `TEXT`, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to be `TEXT`. One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like `Concat`, but is now also the case for `Greatest`, `Least`, `Case`, `Coalesce` and `MvAppend`. An associated change is that the type casting operator `::text` has been entirely removed. It used to map onto the `ToString` function which returned type KEYWORD, and so `::text` really produced a `KEYWORD`, which is a lie, or at least a `bug`, which is now fixed. Should we ever wish to actually produce real `TEXT`, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again). ### Backwards compatibility issues: This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios: * We used the capability `functions_never_emit_text` to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text. * We used `skipTest` to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14). In all cases the change observed in these tests was simply the results columns no longer having `text` type, and instead being `keyword`. --------- Co-authored-by: Luigi Dell'Aquila <[email protected]>
Always return `KEYWORD` for functions that previously returned `TEXT`, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the `TEXT` field value. In principle, if the attribute is no longer the original `FieldAttribute`, it can no longer claim to have the type `TEXT`. This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced `KEYWORD` for `TEXT` input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produce `TEXT`, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to be `TEXT`. One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like `Concat`, but is now also the case for `Greatest`, `Least`, `Case`, `Coalesce` and `MvAppend`. An associated change is that the type casting operator `::text` has been entirely removed. It used to map onto the `ToString` function which returned type KEYWORD, and so `::text` really produced a `KEYWORD`, which is a lie, or at least a `bug`, which is now fixed. Should we ever wish to actually produce real `TEXT`, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again). ### Backwards compatibility issues: This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios: * We used the capability `functions_never_emit_text` to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text. * We used `skipTest` to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14). In all cases the change observed in these tests was simply the results columns no longer having `text` type, and instead being `keyword`. --------- Co-authored-by: Luigi Dell'Aquila <[email protected]>
TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed collaterally by #114334 (8.17.0) Also, correct some of the tests skipping (that had however no impact, due to testing range).
TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed collaterally by elastic#114334 (8.17.0) Also, correct some of the tests skipping (that had however no impact, due to testing range). (cherry picked from commit edb3818)
* ESQL: Rewrite TO_UPPER/TO_LOWER comparisons (#118870) This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function. This also allows later pushing down the predicate to lucene as a case-insensitive term-query. Fixes #118304. * Disable `TO_UPPER(null)`-tests prior to 8.17 (#119213) TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed collaterally by #114334 (8.17.0) Also, correct some of the tests skipping (that had however no impact, due to testing range). (cherry picked from commit edb3818)
TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed collaterally by elastic#114334 (8.17.0) Also, correct some of the tests skipping (that had however no impact, due to testing range).
Always return
KEYWORD
for functions that previously returnedTEXT
, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with theTEXT
field value. In principle, if the attribute is no longer the originalFieldAttribute
, it can no longer claim to have the typeTEXT
.This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced
KEYWORD
forTEXT
input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produceTEXT
, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to beTEXT
.One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like
Concat
, but is now also the case forGreatest
,Least
,Case
,Coalesce
andMvAppend
.An associated change is that the type casting operator
::text
has been entirely removed. It used to map onto theToString
function which returned type KEYWORD, and so::text
really produced aKEYWORD
, which is a lie, or at least abug
, which is now fixed. Should we ever wish to actually produce realTEXT
, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again).Backwards compatibility issues:
This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios:
functions_never_emit_text
to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text.skipTest
to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14).In all cases the change observed in these tests was simply the results columns no longer having
text
type, and instead beingkeyword
.Kibana
Kibana makes use of some JSON files we produce during tests runs that explain the signatures of the functions, and the available type-casting operators. These are now edited to remove all
TEXT
output, and remove the::test
operator.Fixes
Fixes #114333
Fixes #111537