[BUG] WIldcard query behavior for text fields changed in an backward-incompatible way #8711

kartg · 2023-07-16T01:10:05Z

Describe the bug
It appears that PR #5462 changed the behavior for wildcard queries against text fields. Queries that previously worked (due to the behavior documented in #5461) now no longer return any results. IMO, this is backwards-incompatible change that should not have been backported to 2.x

To Reproduce
This can be repro'd using the OpenSearch docker images. First, start a node container:

docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:2.5.0

Then, create a test index with a test field of type text and add data to this:

curl -X PUT -H 'Content-Type: application/json' 'localhost:9200/test_index?pretty' -d'
{
  "settings": { 
    "number_of_shards": 1, 
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "test_field": { "type": "text" }
    }
  }
}'

curl -X POST "localhost:9200/test_index/_doc?pretty" -H 'Content-Type: application/json' -d'
{
  "test_field": "1A2B3C"
}'

Now perform a wildcard search and observe that no hits are returned:

curl -H 'Content-Type: application/json' localhost:9200/test_index/_search --data '{
"query": {
    "wildcard": {
      "test_field": {
        "value": "1A*"
    }
  }
 }
}'

Next, repeat the same steps with the previous minor release Docker image - opensearchproject/opensearch:2.4.1 - and observe that the wildcard query returns a result

Expected behavior
Unclear. The PR seems to fix erroneous behavior but introduces a breaking change.

Plugins
N/A

Screenshots
N/A

Host/Environment (please complete the following information):

OS: [e.g. iOS] MacOS 12.6.6
Version [e.g. 22] OpenSearch 2.5.0

Additional context
A workaround for this bug is to use a query_string query instead:

curl -H 'Content-Type: application/json' localhost:9200/test_index/_search --data '{
"query": {
    "query_string": {
      "default_field": "test_field",
      "query": "1A*"
    }
  }

Finally, this bug only affects analyzed/normalized fields types. AFAIK, this means only keyword and text are affected. The PR updates KeywordFieldMapper to override this value since keyword fields are always normalized so keyword field types are not affected by this change.

The text was updated successfully, but these errors were encountered:

macohen · 2023-07-17T20:35:20Z

@nknize can you take a look at this please? It looks related to #5462

dblock · 2023-07-17T20:48:41Z

@kartg looks broken :( if you have time, write a YAML REST test for this next

nknize · 2023-07-17T20:51:02Z

This is expected behavior. The default tokenizer first lower cases the input string at index time. So running a query_string wildcard query with an uppercase will not (and should not) yield any matches. The prior behavior was a "false positive" bug and fixed in #5462. If you want exact wildcard matches, the user should use a different "analyzer" that doesn't normalize the input strings.

That patch did yield another change that also provides the behavior you seek along w/ a yaml test.

dblock · 2023-07-17T20:56:22Z

Thanks @nknize, sounds like I'm wrong on it being a regression, should we close this?

nknize · 2023-07-17T21:51:58Z

Thanks @nknize, ...should we close this?

Yes. I'm marking as a non-issue and closing. @kartg feel free to re-open for additional discussion if you have specific downstream breaking functionality.

kartg · 2023-07-18T18:47:16Z

The prior behavior was a "false positive" bug and fixed in #5462. If you want exact wildcard matches, the user should use a different "analyzer" that doesn't normalize the input strings.

@nknize reopening because i'd like to discuss if such a behavior change breaks semver.

As I outlined in the repro steps above, a user query on a text field which previously returned hits no longer does so after a minor version update. Asking the user to be aware that the standard analyzer lower-cases the stored string thus their search queries must now also be in lower case (or they must now supply .caseInsensitive(true)) is a breaking, backwards-incompatible change IMO. There's also override logic for keyword field types so that such values continue to match in a case-insensitive manner, so this behavior is inconsistent between analyzed types.

So running a query_string wildcard query with an uppercase will not (and should not) yield any matches.

Note that the use-case i'm describing in this issue is for the wildcard query. The query_string logic was updated in the same PR to normalize the search query text so such a query will in fact yield matches and is the basis for the workaround i've documented.

nknize · 2023-07-18T22:37:10Z

i'd like to discuss if such a behavior change breaks semver.

It doesn't because this is a bug that was fixed. We shouldn't set a precedent of retaining bug compatibility, thats a slippery slope door we should never open.

Asking the user to be aware that the standard analyzer lower-cases the stored string thus their search queries must now also be in lower case (or they must now supply .caseInsensitive(true)) is a breaking, backwards-incompatible change IMO.

Users of Elasticsearch and OpenSearch should already be aware that what they index is not the raw string and what tokens are created by the Standard Analyzer and Standard Tokenizer. If they don't they'll have no idea what's in their index. This isn't breaking because capital letters should never have matched in the first place - that's why case_insensitive exists at all (not to mention that parameter didn't work for the same reason the bug existed, thus it was fixed).

Note that the use-case i'm describing in this issue is for the wildcard query.

So the user should set case_insensitive to true and it should match. If that's not happening then that bug should be fixed - not opening bwc bug compatibility.

kartg · 2023-07-19T01:44:35Z

We shouldn't set a precedent of retaining bug compatibility

I whole heartedly agree, which is why my answer to the "expected behavior" section starts with "unclear" 😄

Users of Elasticsearch and OpenSearch should already be aware that what they index is not the raw string and what tokens are created by the Standard Analyzer and Standard Tokenizer. If they don't they'll have no idea what's in their index.

This assumes that the users/clients querying for data are the same (or at least in lock step) as the users/clients indexing data. I don't necessarily agree with this assumption, but that's a separate concern. In the interest of consistent behavior, why do we want wildcard queries on keyword field and query_string queries to behave differently? Shouldn't these queries also accurately reflect the contents of the index?

PS - There's an xkcd comic for this because....of course 🤣

macohen · 2023-07-26T18:01:30Z

What version was the original bug you fixed, @nknize, introduced? I think we should document that the bug was introduced in some version, fixed in 2.5, but users may have to change their implementation if the assumption was that the bug was expected behavior.

cc: @kolchfa-aws

dblock · 2023-07-27T16:08:09Z

+1, this should go into a "breaking changes" section of docs

macohen · 2023-08-15T17:10:30Z

Created a doc issue: opensearch-project/documentation-website#4788

kartg added bug Something isn't working untriaged labels Jul 16, 2023

kartg changed the title ~~[BUG] WIldcard queries are broken for text fields~~ [BUG] WIldcard query behavior for text fields changed in an backward-incompatible way Jul 16, 2023

macohen removed the untriaged label Jul 17, 2023

macohen assigned nknize Jul 17, 2023

macohen added the Search Search query, autocomplete ...etc label Jul 17, 2023

github-project-automation bot added this to Search Project Board Jul 17, 2023

github-project-automation bot moved this to 🆕 New in Search Project Board Jul 17, 2023

nknize closed this as completed Jul 17, 2023

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Jul 17, 2023

nknize added the non-issue bugs / unexpected behaviors that end up non issues; audit trail simple changes that aren't issues label Jul 17, 2023

kartg reopened this Jul 18, 2023

github-project-automation bot moved this from ✅ Done to 🏗 In progress in Search Project Board Jul 18, 2023

github-actions bot added the untriaged label Jul 18, 2023

macohen removed the untriaged label Jul 19, 2023

macohen mentioned this issue Aug 15, 2023

[DOC] Wildcard Query Bug Fix Created a Breaking Change opensearch-project/documentation-website#4788

Closed

4 tasks

mingshl closed this as completed Aug 17, 2023

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Search Project Board Aug 17, 2023

macohen mentioned this issue Aug 28, 2023

added missing documentation about case_insensitive param in wildcard … opensearch-project/documentation-website#4581

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] WIldcard query behavior for text fields changed in an backward-incompatible way #8711

[BUG] WIldcard query behavior for text fields changed in an backward-incompatible way #8711

kartg commented Jul 16, 2023 •

edited

Loading

macohen commented Jul 17, 2023

dblock commented Jul 17, 2023

nknize commented Jul 17, 2023 •

edited

Loading

dblock commented Jul 17, 2023

nknize commented Jul 17, 2023

kartg commented Jul 18, 2023

nknize commented Jul 18, 2023 •

edited

Loading

kartg commented Jul 19, 2023

macohen commented Jul 26, 2023

dblock commented Jul 27, 2023

macohen commented Aug 15, 2023

[BUG] WIldcard query behavior for text fields changed in an backward-incompatible way #8711

[BUG] WIldcard query behavior for text fields changed in an backward-incompatible way #8711

Comments

kartg commented Jul 16, 2023 • edited Loading

macohen commented Jul 17, 2023

dblock commented Jul 17, 2023

nknize commented Jul 17, 2023 • edited Loading

dblock commented Jul 17, 2023

nknize commented Jul 17, 2023

kartg commented Jul 18, 2023

nknize commented Jul 18, 2023 • edited Loading

kartg commented Jul 19, 2023

macohen commented Jul 26, 2023

dblock commented Jul 27, 2023

macohen commented Aug 15, 2023

kartg commented Jul 16, 2023 •

edited

Loading

nknize commented Jul 17, 2023 •

edited

Loading

nknize commented Jul 18, 2023 •

edited

Loading