Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] WIldcard query behavior for text fields changed in an backward-incompatible way #8711

Closed
kartg opened this issue Jul 16, 2023 · 11 comments
Assignees
Labels
bug Something isn't working non-issue bugs / unexpected behaviors that end up non issues; audit trail simple changes that aren't issues Search Search query, autocomplete ...etc

Comments

@kartg
Copy link
Member

kartg commented Jul 16, 2023

Describe the bug
It appears that PR #5462 changed the behavior for wildcard queries against text fields. Queries that previously worked (due to the behavior documented in #5461) now no longer return any results. IMO, this is backwards-incompatible change that should not have been backported to 2.x

To Reproduce
This can be repro'd using the OpenSearch docker images. First, start a node container:

docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:2.5.0

Then, create a test index with a test field of type text and add data to this:

curl -X PUT -H 'Content-Type: application/json' 'localhost:9200/test_index?pretty' -d'
{
  "settings": { 
    "number_of_shards": 1, 
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "test_field": { "type": "text" }
    }
  }
}'

curl -X POST "localhost:9200/test_index/_doc?pretty" -H 'Content-Type: application/json' -d'
{
  "test_field": "1A2B3C"
}'

Now perform a wildcard search and observe that no hits are returned:

curl -H 'Content-Type: application/json' localhost:9200/test_index/_search --data '{
"query": {
    "wildcard": {
      "test_field": {
        "value": "1A*"
    }
  }
 }
}'

Next, repeat the same steps with the previous minor release Docker image - opensearchproject/opensearch:2.4.1 - and observe that the wildcard query returns a result

Expected behavior
Unclear. The PR seems to fix erroneous behavior but introduces a breaking change.

Plugins
N/A

Screenshots
N/A

Host/Environment (please complete the following information):

  • OS: [e.g. iOS] MacOS 12.6.6
  • Version [e.g. 22] OpenSearch 2.5.0

Additional context
A workaround for this bug is to use a query_string query instead:

curl -H 'Content-Type: application/json' localhost:9200/test_index/_search --data '{
"query": {
    "query_string": {
      "default_field": "test_field",
      "query": "1A*"
    }
  }

Finally, this bug only affects analyzed/normalized fields types. AFAIK, this means only keyword and text are affected. The PR updates KeywordFieldMapper to override this value since keyword fields are always normalized so keyword field types are not affected by this change.

@kartg kartg added bug Something isn't working untriaged labels Jul 16, 2023
@kartg kartg changed the title [BUG] WIldcard queries are broken for text fields [BUG] WIldcard query behavior for text fields changed in an backward-incompatible way Jul 16, 2023
@macohen
Copy link
Contributor

macohen commented Jul 17, 2023

@nknize can you take a look at this please? It looks related to #5462

@macohen macohen removed the untriaged label Jul 17, 2023
@macohen macohen added the Search Search query, autocomplete ...etc label Jul 17, 2023
@dblock
Copy link
Member

dblock commented Jul 17, 2023

@kartg looks broken :( if you have time, write a YAML REST test for this next

@nknize
Copy link
Collaborator

nknize commented Jul 17, 2023

This is expected behavior. The default tokenizer first lower cases the input string at index time. So running a query_string wildcard query with an uppercase will not (and should not) yield any matches. The prior behavior was a "false positive" bug and fixed in #5462. If you want exact wildcard matches, the user should use a different "analyzer" that doesn't normalize the input strings.

That patch did yield another change that also provides the behavior you seek along w/ a yaml test.

@dblock
Copy link
Member

dblock commented Jul 17, 2023

Thanks @nknize, sounds like I'm wrong on it being a regression, should we close this?

@nknize
Copy link
Collaborator

nknize commented Jul 17, 2023

Thanks @nknize, ...should we close this?

Yes. I'm marking as a non-issue and closing. @kartg feel free to re-open for additional discussion if you have specific downstream breaking functionality.

@nknize nknize closed this as completed Jul 17, 2023
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Jul 17, 2023
@nknize nknize added the non-issue bugs / unexpected behaviors that end up non issues; audit trail simple changes that aren't issues label Jul 17, 2023
@kartg
Copy link
Member Author

kartg commented Jul 18, 2023

The prior behavior was a "false positive" bug and fixed in #5462. If you want exact wildcard matches, the user should use a different "analyzer" that doesn't normalize the input strings.

@nknize reopening because i'd like to discuss if such a behavior change breaks semver.

As I outlined in the repro steps above, a user query on a text field which previously returned hits no longer does so after a minor version update. Asking the user to be aware that the standard analyzer lower-cases the stored string thus their search queries must now also be in lower case (or they must now supply .caseInsensitive(true)) is a breaking, backwards-incompatible change IMO. There's also override logic for keyword field types so that such values continue to match in a case-insensitive manner, so this behavior is inconsistent between analyzed types.

So running a query_string wildcard query with an uppercase will not (and should not) yield any matches.

Note that the use-case i'm describing in this issue is for the wildcard query. The query_string logic was updated in the same PR to normalize the search query text so such a query will in fact yield matches and is the basis for the workaround i've documented.

@kartg kartg reopened this Jul 18, 2023
@github-project-automation github-project-automation bot moved this from ✅ Done to 🏗 In progress in Search Project Board Jul 18, 2023
@nknize
Copy link
Collaborator

nknize commented Jul 18, 2023

i'd like to discuss if such a behavior change breaks semver.

It doesn't because this is a bug that was fixed. We shouldn't set a precedent of retaining bug compatibility, thats a slippery slope door we should never open.

Asking the user to be aware that the standard analyzer lower-cases the stored string thus their search queries must now also be in lower case (or they must now supply .caseInsensitive(true)) is a breaking, backwards-incompatible change IMO.

Users of Elasticsearch and OpenSearch should already be aware that what they index is not the raw string and what tokens are created by the Standard Analyzer and Standard Tokenizer. If they don't they'll have no idea what's in their index. This isn't breaking because capital letters should never have matched in the first place - that's why case_insensitive exists at all (not to mention that parameter didn't work for the same reason the bug existed, thus it was fixed).

Note that the use-case i'm describing in this issue is for the wildcard query.

So the user should set case_insensitive to true and it should match. If that's not happening then that bug should be fixed - not opening bwc bug compatibility.

@kartg
Copy link
Member Author

kartg commented Jul 19, 2023

We shouldn't set a precedent of retaining bug compatibility

I whole heartedly agree, which is why my answer to the "expected behavior" section starts with "unclear" 😄

Users of Elasticsearch and OpenSearch should already be aware that what they index is not the raw string and what tokens are created by the Standard Analyzer and Standard Tokenizer. If they don't they'll have no idea what's in their index.

This assumes that the users/clients querying for data are the same (or at least in lock step) as the users/clients indexing data. I don't necessarily agree with this assumption, but that's a separate concern. In the interest of consistent behavior, why do we want wildcard queries on keyword field and query_string queries to behave differently? Shouldn't these queries also accurately reflect the contents of the index?


PS - There's an xkcd comic for this because....of course 🤣

@macohen macohen removed the untriaged label Jul 19, 2023
@macohen
Copy link
Contributor

macohen commented Jul 26, 2023

What version was the original bug you fixed, @nknize, introduced? I think we should document that the bug was introduced in some version, fixed in 2.5, but users may have to change their implementation if the assumption was that the bug was expected behavior.

cc: @kolchfa-aws

@dblock
Copy link
Member

dblock commented Jul 27, 2023

+1, this should go into a "breaking changes" section of docs

@macohen
Copy link
Contributor

macohen commented Aug 15, 2023

Created a doc issue: opensearch-project/documentation-website#4788

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-issue bugs / unexpected behaviors that end up non issues; audit trail simple changes that aren't issues Search Search query, autocomplete ...etc
Projects
Archived in project
Development

No branches or pull requests

5 participants