[BUG] Uppercase Regex in String queries search #3578

yyyogev · 2022-06-14T10:40:38Z

Regex search on analyzed fields doesn’t work with capital letters now as if the analyzer saves them lower cased, or doesn't use the standard analyzer for some reason.

To Reproduce
Steps to reproduce the behavior:

Create an index and add a doc containing uppercase letters to it (e.g message: this is a TLS handshake)
Perform this search request:
{"query": {"bool": {"must": [{"query_string": {"query": "message:/TLS/"}}]}}}

Expected behavior
We should find the doc we added above since it's an exact match. However, we got 0 results.

Plugins
None

Host/Environment (please complete the following information):
I used OpenSearch container with a single node (version 1.2.3), running on iOS 11.6.4

Additional context
If we specify the standard analyzer to the search request, we get the expected results.
{"query": {"bool": {"must": [{"query_string": {"query": "message:/TLS/", "analyzer":"standard"}}]}}}

The text was updated successfully, but these errors were encountered:

reta · 2022-06-14T12:44:17Z

@yyyogev could you please share the mapping for message field?

yyyogev · 2022-06-14T12:45:48Z

I didn't not specify it, so whatever the default is

reta · 2022-06-14T13:32:27Z

@yyyogev OK, in this the assumption is that you use standard analyzer [1] for the message field as well, which includes Lower Case Token Filter. As such, the behavior your are observing is expected. There are three options to align search with the ingestion:

use "analyzer":"standard" (as you did)
use lowercased search queries {"query": "message:/tls/"}
use regexp query and case_insensitive search [2]:

 {"query": {"bool": {"must": [{"regexp": {"message": { "value": "TLS", "case_insensitive": true }}}]}}}

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-standard-analyzer.html
[2] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl-regexp-query.html

yyyogev · 2022-06-14T13:44:09Z

This behavior was different at ES 6.8, is there documentation for this change of behavior anywhere?

reta · 2022-06-14T15:13:43Z

I am not aware on any changes to be fair on search side, perhaps there are differences on ingestion side (mappings).

yyyogev · 2022-06-15T12:57:51Z

Doesn't seem so, this is the mapping for the doc in ES 6.8:
{ "_doc": { "properties": { "message": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }

and this is at OS 1.2.3:
{ "properties": { "message": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } }

reta · 2022-06-15T19:53:11Z

@yyyogev I was really curious to find out why, here is the thing: 6.8.x used default search analyzer (== standard) whereas it has changed by [1] to only use analyzer if it is specified at search (query) time.

[1] https://github.com/elastic/elasticsearch/pull/61013/files

yyyogev · 2022-06-15T21:23:41Z

And if not then it doesn't use any analyzers at all?
The thing is, it's very weird behavior, that the actual string won't be matched and the lowercase of it will..

yyyogev · 2022-06-20T15:10:47Z

@reta is there a way to use some analyzer (standard or a custom one) without specifying it in each and every request?

reta · 2022-06-20T16:46:27Z

And if not then it doesn't use any analyzers at all?

For regex specifically - yes

The thing is, it's very weird behavior, that the actual string won't be matched and the lowercase of it will..

You could use the case_insensitive in your query to overcome that

@reta is there a way to use some analyzer (standard or a custom one) without specifying it in each and every request?

To be honest, I was not able to find the other way around by just looking into code, could you please try the other options listed here [1]?

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/specify-analyzer.html#specify-search-analyzer

yyyogev · 2022-06-21T09:43:04Z

@reta case_insensitive is not one of query_string params

reta · 2022-06-21T12:19:14Z

@reta case_insensitive is not one of query_string params

Correct, if using regexp as specialized query is not an option - disregard it please, at worst - you could try to lowercase query before sending it to Opensearch.

yyyogev · 2022-06-21T13:14:28Z

This creates another problem, if we have lowercasing the query, it breaks the search for terms of type keyword since these might be saved with uppercase letters

reta · 2022-06-21T13:45:51Z

@yyyogev I see, we have limited number of choices than (at this moment), specifying analyzer seems to be the simplest one

yyyogev · 2022-06-21T16:55:26Z

@reta we tried this already and it doesn't work as well.. I specified the search analyzer for the whole index with these settings:

  "dodo": {
    "settings": {
      "index": {
        "number_of_shards": "1",
        "provided_name": "dodo",
        "creation_date": "1655830029877",
        "analysis": {
          "analyzer": {
            "default_search": {
              "type": "standard"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "fFU7oGbEQwSJVHn7DFjoXQ",
        "version": {
          "created": "136217827"
        }
      }
    }
  }
}

reta · 2022-06-21T17:12:00Z

@yyyogev I meant the approach you mentioned in the issue, analyzer at query time, sorry for confusion:

If we specify the standard analyzer to the search request, we get the expected results.
{"query": {"bool": {"must": [{"query_string": {"query": "message:/TLS/", "analyzer":"standard"}}]}}}

yyyogev · 2022-06-21T17:17:01Z

@reta this solution has the same problem.. it would lowercase keywords won't find keywords with uppercase letters

reta · 2022-06-21T18:10:37Z

@yyyogev sadly I do not have universal solution for you right now, there is Apache Lucene issue / pull request to support case-insensitivity for regexes in query_string (exactly your case) [1], fe /TLS/i but it seems to be abandoned. I assume you use whatever query string is provided, if you really have to preserve regex behavior, you would need to preprocess the query to lower-case the regexes (and keep everything else intact).

On the Opensearch side, what we could do, is to consult search_analyzer for regex queries, basically reverting back this change since Opensearch does not support wildcard field type:

setAnalyzer(forceAnalyzer == null ? queryBuilder.context.getSearchAnalyzer(currentFieldType) : forceAnalyzer);

@nknize what do you think?

[1] apache/lucene-solr#1708

nknize · 2022-06-27T16:30:28Z

On the OpenSearch side we should add case_insensitive support like we did for RegexpIntervalsSourceProvider.

In the meantime I think in this situation you'll want to use the PatternAnalyzer at query time. In the pattern analyzer the Lowercase TokenFilter is used to auto lowercase your query string so it matches the default lowercased tokens created at index time by the Standard analyzer. If you want to preserve case for exact matches you could use an index template (like you did above) to define a custom analyzer to use at index time that removes the LowerCase token filter. Then use PatternAnalyzer at query time w/ lowercase set to false. The affect would be nothing is lowercased, so query strings would have to be exact match at query time (sounds like this is not what you want).

reta · 2022-06-27T18:14:57Z

@nknize sorry, just to reiterate one thing, for regex queries (specifically query_string scenario), the per-field query time analyzer is ignored, it could be forced but for all fields.

alexgnatyuk · 2022-07-04T11:55:14Z

@reta @nknize we believe that reverting the change (aligning the behavior with the way it worked before the fix) should be the way to go in this case because:

the initial fix was needed only for fields with the type of wildcard, which OpenSearch doesn't support. There is no real reason for this code to exist in the OpenSearch.
the fix changes make the regex search case sensitive in a very weird way since we don't automatically apply type-specific analyzers at a search time for regex anymore, although we do it on indexing time by default.
the fix introduces an inability to apply type-specific analyzers during the query string regex searches (even if set on an index or type mapping level), which reduces the opportunities to affect the query parsing and search flow in general.

please let me know what you think about it.

and we of course happy to contribute to it if that's the way to go here.

@AmiStrn FYI

AmiStrn · 2022-07-05T16:08:48Z

@reta I agree with @alexgnatyuk here. this seems like a really easy fix if the code introduced in Elasticsearch 7.9 is reverted since it is an x-pack feature that is using it.
It is a bug, really, that has gone unnoticed for a while.
users should not have to be explicit about this for the regex's in their query since many times they will be using OpenSearch-dashboards and not have that level of knowledge that they need to also specify the analyzer type somehow (that would be really bad UX).

the discussion is about the proposed solution - what do you think? should we make this change as proposed?

reta · 2022-07-05T16:49:31Z

I am totally with you @AmiStrn @alexgnatyuk (#3578 (comment)), need @nknize confirmation this is away to go

AmiStrn · 2022-07-10T14:19:12Z

@nknize can you please update here? green light for us to start working on these changes?

yyyogev added bug Something isn't working untriaged labels Jun 14, 2022

kotwanikunal added Indexing & Search and removed untriaged labels Jun 21, 2022

yyyogev mentioned this issue Jul 21, 2022

Set analyzer to regex query string search #3967

Merged

5 tasks

nknize closed this as completed in #3967 Aug 15, 2022

nknize mentioned this issue Oct 18, 2022

[CI] test failure: search/190_index_prefix_search/search with uppercase regex} #4831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Uppercase Regex in String queries search #3578

[BUG] Uppercase Regex in String queries search #3578

yyyogev commented Jun 14, 2022 •

edited

Loading

reta commented Jun 14, 2022

yyyogev commented Jun 14, 2022 •

edited

Loading

reta commented Jun 14, 2022

yyyogev commented Jun 14, 2022

reta commented Jun 14, 2022

yyyogev commented Jun 15, 2022

reta commented Jun 15, 2022

yyyogev commented Jun 15, 2022 •

edited

Loading

yyyogev commented Jun 20, 2022

reta commented Jun 20, 2022

yyyogev commented Jun 21, 2022

reta commented Jun 21, 2022

yyyogev commented Jun 21, 2022 •

edited

Loading

reta commented Jun 21, 2022

yyyogev commented Jun 21, 2022 •

edited

Loading

reta commented Jun 21, 2022 •

edited

Loading

yyyogev commented Jun 21, 2022

reta commented Jun 21, 2022

nknize commented Jun 27, 2022 •

edited

Loading

reta commented Jun 27, 2022

alexgnatyuk commented Jul 4, 2022 •

edited

Loading

AmiStrn commented Jul 5, 2022 •

edited

Loading

reta commented Jul 5, 2022

AmiStrn commented Jul 10, 2022 •

edited

Loading

[BUG] Uppercase Regex in String queries search #3578

[BUG] Uppercase Regex in String queries search #3578

Comments

yyyogev commented Jun 14, 2022 • edited Loading

reta commented Jun 14, 2022

yyyogev commented Jun 14, 2022 • edited Loading

reta commented Jun 14, 2022

yyyogev commented Jun 14, 2022

reta commented Jun 14, 2022

yyyogev commented Jun 15, 2022

reta commented Jun 15, 2022

yyyogev commented Jun 15, 2022 • edited Loading

yyyogev commented Jun 20, 2022

reta commented Jun 20, 2022

yyyogev commented Jun 21, 2022

reta commented Jun 21, 2022

yyyogev commented Jun 21, 2022 • edited Loading

reta commented Jun 21, 2022

yyyogev commented Jun 21, 2022 • edited Loading

reta commented Jun 21, 2022 • edited Loading

yyyogev commented Jun 21, 2022

reta commented Jun 21, 2022

nknize commented Jun 27, 2022 • edited Loading

reta commented Jun 27, 2022

alexgnatyuk commented Jul 4, 2022 • edited Loading

AmiStrn commented Jul 5, 2022 • edited Loading

reta commented Jul 5, 2022

AmiStrn commented Jul 10, 2022 • edited Loading

yyyogev commented Jun 14, 2022 •

edited

Loading

yyyogev commented Jun 14, 2022 •

edited

Loading

yyyogev commented Jun 15, 2022 •

edited

Loading

yyyogev commented Jun 21, 2022 •

edited

Loading

yyyogev commented Jun 21, 2022 •

edited

Loading

reta commented Jun 21, 2022 •

edited

Loading

nknize commented Jun 27, 2022 •

edited

Loading

alexgnatyuk commented Jul 4, 2022 •

edited

Loading

AmiStrn commented Jul 5, 2022 •

edited

Loading

AmiStrn commented Jul 10, 2022 •

edited

Loading