-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Fuzziness reject illegal values earlier #33511
Conversation
Pinging @elastic/es-search-aggs |
The current java implementation of Fuzziness leaves a lot of room for initializing it with illegal values that either cause errors later when the queries reach the shards where they are executed or values that are silently ignored in favour of defaults. We should instead tighten the java implementation of the class so that we only accept supported values. Currently those are numeric values representing the edit distances 0, 1 and 2, optionally also as float or string, and the "AUTO" fuzziness, which can come in a cusomizable variant that allows specifying two value that define the positions in a term where the AUTO option increases the allowed edit distance. This change removes several redundant ways of object construction and adds input validation to the remaining ones. Java users should either use one of the predefined constants or use the static factory methods `fromEdits(int)` or `fromString(String)` to create instances of the class, while other ctors are hidden. This allows for instance control, e.g. returning one of the constants when creating instances from an integer value. Previously the class would accept any positive integer value and any float value, while in effect the maximum allowed edit distance was capped at 2 in practice. These values while throw an error now, as will any other String value other than "AUTO" that where previously accepted but led to numeric exceptions when the query was executed.
5dd7e7c
to
14bfd51
Compare
@elasticmachine test this please |
* Note: Using this method only makes sense if the field you are applying Fuzziness to is some sort of string. | ||
* @throws IllegalArgumentException if the edit distance is not in [0, 1, 2] | ||
*/ | ||
public static Fuzziness fromEdits(int edits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method used to be very permissive, e.g. parsing float values and silently casting them to ints, where values larger than 2 get ignored somewhere later in the code I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cbuescher , it looks good. I left one question
} else if (upperCase.startsWith("AUTO:")) { | ||
return parseCustomAuto(upperCase); | ||
} else { | ||
// should be a float or int representing a valid edit distance, otherwise throw error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we restrict to int ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I see this still leaves the door open for something like "fuzziness" : 1.5
to be silently cast to int 1. At the same time I think it would be good to continue to allow floats equivalent to the allowed edit distances 0,1,2, e.g. allow "1.0" or "1.0000" but reject "1.1". This is because I remember other issues where users going through rest sometimes had issues that their json-strigification produces float-like values even if they feed them ints (I think this was mostly in javascript clients). Anyway, would you be okay to allow "0.0", "1.0" and "2.0" so anything evenly divisible still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@jimczi I added a commit that disallows float values with 0 decimals (like "1.00", "2.0" etc...) for the three allowed values but rejects other floats. Let me know if that sounds okay to you. |
@elasticmachine run the gradle build tests 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed the updates on this pr. There are some conflicts that need to be resolved but the logic seems good to me.
} else if (upperCase.startsWith("AUTO:")) { | ||
return parseCustomAuto(upperCase); | ||
} else { | ||
// should be a float or int representing a valid edit distance, otherwise throw error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@jimczi thanks for the review, I just resolved the merge conflict, will go over the PR again since its a bit old to check whether the docs need some updating anywhere and add a migration note since this changes the Java API slightly. Currently also only targeting 8 with this cleanup. |
Sure, thanks for reviving this and +1 to target 8 only. |
* elastic/master: (36 commits) Remove unneded cluster config from test (elastic#40856) Make Fuzziness reject illegal values earlier (elastic#33511) Remove test-only customisation from TransReplAct (elastic#40863) Fix dense/sparse vector limit documentation (elastic#40852) Make -try xlint warning disabled by default. (elastic#40833) Async Snapshot Repository Deletes (elastic#40144) Revert "Replace usages RandomizedTestingTask with built-in Gradle Test (elastic#40564)" Init global checkpoint after copy commit in peer recovery (elastic#40823) Replace usages RandomizedTestingTask with built-in Gradle Test (elastic#40564) [DOCS] Removed redundant (not quite right) information about upgrades. Remove string usages of old transport settings (elastic#40818) Rollup ignores time_zone on date histogram (elastic#40844) HLRC: fix uri encode bug when url path starts with '/' (elastic#34436) Mutes GatewayIndexStateIT.testRecoverBrokenIndexMetadata Docs: Pin two IDs in the rest client (elastic#40785) Adds version 6.7.2 [DOCS] Remind users to include @ symbol when applying license file (elastic#40688) HLRC: Convert xpack methods to client side objects (elastic#40705) Allow ILM to stop if indices have nonexistent policies (elastic#40820) Add an OpenID Connect authentication realm (elastic#40674) ...
* master: (63 commits) Suppress lease background sync failures if stopping (elastic#40902) [DOCS] Added settings page for ILM. (elastic#40880) [Docs] Remove extraneous text (elastic#40914) Move test classes to test root in Painless (elastic#40873) Fix date index name processor default date_formats (elastic#40915) Source additional files correctly in elasticsearch-cli (elastic#40890) Allow AVX-512 on JDK 11+ (elastic#40828) [Docs] Change example to show col headers (elastic#40822) Update apache httpclient to version 4.5.8 (elastic#40875) Update monitoring-kibana.json (elastic#40899) Introduce Delegating ActionListener Wrappers (elastic#40129) Deprecate old transport settings (elastic#40821) Add Kibana application privileges for monitoring and ml reserved roles (elastic#40651) Use Writeable for TransportReplAction derivatives (elastic#40894) Add test for HTTP and Transport TLS on basic license (elastic#40714) Remove unneded cluster config from test (elastic#40856) Make Fuzziness reject illegal values earlier (elastic#33511) Remove test-only customisation from TransReplAct (elastic#40863) Fix dense/sparse vector limit documentation (elastic#40852) Make -try xlint warning disabled by default. (elastic#40833) ...
* master: (77 commits) Suppress lease background sync failures if stopping (elastic#40902) [DOCS] Added settings page for ILM. (elastic#40880) [Docs] Remove extraneous text (elastic#40914) Move test classes to test root in Painless (elastic#40873) Fix date index name processor default date_formats (elastic#40915) Source additional files correctly in elasticsearch-cli (elastic#40890) Allow AVX-512 on JDK 11+ (elastic#40828) [Docs] Change example to show col headers (elastic#40822) Update apache httpclient to version 4.5.8 (elastic#40875) Update monitoring-kibana.json (elastic#40899) Introduce Delegating ActionListener Wrappers (elastic#40129) Deprecate old transport settings (elastic#40821) Add Kibana application privileges for monitoring and ml reserved roles (elastic#40651) Use Writeable for TransportReplAction derivatives (elastic#40894) Add test for HTTP and Transport TLS on basic license (elastic#40714) Remove unneded cluster config from test (elastic#40856) Make Fuzziness reject illegal values earlier (elastic#33511) Remove test-only customisation from TransReplAct (elastic#40863) Fix dense/sparse vector limit documentation (elastic#40852) Make -try xlint warning disabled by default. (elastic#40833) ...
The current java implementation of Fuzziness leaves a lot of room for initializing it with illegal values that either cause errors later when the queries reach the shards where they are executed or values that are silently ignored in favour of defaults. We should instead tighten the java implementation of the class so that we only accept supported values. Currently those are numeric values representing the edit distances 0, 1 and 2, optionally also as float or string, and the "AUTO" fuzziness, which can come in a cusomizable variant that allows specifying two value that define the positions in a term where the AUTO option increases the allowed edit distance. This change removes several redundant ways of object construction and adds input validation to the remaining ones. Java users should either use one of the predefined constants or use the static factory methods `fromEdits(int)` or `fromString(String)` to create instances of the class, while other ctors are hidden. This allows for instance control, e.g. returning one of the constants when creating instances from an integer value. Previously the class would accept any positive integer value and any float value, while in effect the maximum allowed edit distance was capped at 2 in practice. These values while throw an error now, as will any other String value other than "AUTO" that where previously accepted but led to numeric exceptions when the query was executed.
The current java implementation of Fuzziness leaves a lot of room for
initializing it with illegal values that either cause errors later when the
queries reach the shards where they are executed or values that are silently
ignored in favour of defaults. We should instead tighten the java implementation
of the class so that we only accept supported values. Currently those are
numeric values representing the edit distances 0, 1 and 2, optionally also as
float or string, and the "AUTO" fuzziness, which can come in a cusomizable
variant that allows specifying two value that define the positions in a term
where the AUTO option increases the allowed edit distance.
This change removes several redundant ways of object construction and adds input
validation to the remaining ones. Java users should either use one of the
predefined constants or use the static factory methods
fromEdits(int)
orfromString(String)
to create instances of the class, while other ctors arehidden. This allows for instance control, e.g. returning one of the constants
when creating instances from an integer value.
Previously the class would accept any positive integer value and any float
value, while in effect the maximum allowed edit distance was capped at 2 in
practice. These values while throw an error now, as will any other String value
other than "AUTO" that where previously accepted but led to numeric exceptions
when the query was executed.