Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MatchNoDocsQuery from stop words with wildcards in query_string #28856

Closed
rezecib opened this issue Feb 28, 2018 · 1 comment · Fixed by #28871
Closed

MatchNoDocsQuery from stop words with wildcards in query_string #28856

rezecib opened this issue Feb 28, 2018 · 1 comment · Fixed by #28871
Assignees
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@rezecib
Copy link

rezecib commented Feb 28, 2018

Elasticsearch version (bin/elasticsearch --version): 6.2.2, Build: 10b1edd/2018-02-16T19:01:30.685723Z

Plugins installed: ["analysis-icu"]

JVM version (java -version): 9.0.4

OS version (uname -a if on a Unix-like system): Mac OS Sierra 10.12.6
16.7.0 Darwin Kernel Version 16.7.0: Thu Jan 11 22:59:40 PST 2018; root:xnu-3789.73.8~1/RELEASE_X86_64 x86_64

Description of the problem including expected versus actual behavior:
When doing a query_string query with the "stop" analyzer and wildcard analysis enabled, if the query contains stop words and wildcards (on the stop words or on other query terms), the expected behavior (at least, the behavior in Elasticsearch 5.5.0) is for the stop word to be removed from the token stream; the actual behavior is that it gets converted to MatchNoDocsQuery.

Steps to reproduce:

  1. Start an elasticsearch process: bin/elasticsearch
  2. Create a simple index:
curl -X PUT "localhost:9200/stop-wildcard-test" -H "Content-Type: application/json" -d '{
	"mappings": {
		"doc": {
			"properties": {
				"content": { "type": "text" }
			}
		}
	}
}'
  1. Request validation for "on the run*":
curl -X POST "localhost:9200/stop-wildcard-test/_validate/query?explain&pretty" -H "Content-Type: application/json" -d '{
  "query": {
      "query_string": {
        "query": "on the run*",
        "analyzer": "stop",
        "analyze_wildcard": true
      }
  }
}'

Response:

{
  "valid" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "explanations" : [
    {
      "index" : "stop-wildcard-test",
      "valid" : true,
      "explanation" : "MatchNoDocsQuery(\"analysis was empty for content:on\") content:run*"
    }
  ]
}

I also tested the following queries:

  • on the run produces the expected content:run, without MatchNoDocsQuery
  • on* the run produces MatchNoDocsQuery(\"analysis was empty for content:on\") content:run
    Also tested with wildcard analysis off:
  • on the run* produces MatchNoDocsQuery(\"analysis was empty for content:on\") content:run*
  • on the run produces the expected content:run
  • on* the run produces content:on* content:run (unlike with wildcard analysis on)

For comparison, here is the response to the same query from a fresh installation of 5.5.0 (+ analysis_icu for parity, but probably not relevant here?):

{
    "valid": true,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "explanations": [
        {
            "index": "stop-wildcard-test",
            "valid": true,
            "explanation": "_all:run"
        }
    ]
}

Provide logs (if relevant):
Not relevant.

@javanna javanna added the :Search/Search Search-related issues that do not fall into other categories label Mar 1, 2018
@javanna
Copy link
Member

javanna commented Mar 1, 2018

cc @elastic/es-search-aggs

@javanna javanna added :Search Relevance/Analysis How text is split into tokens and removed :Search/Search Search-related issues that do not fall into other categories labels Mar 1, 2018
@jimczi jimczi self-assigned this Mar 1, 2018
@jimczi jimczi added the >bug label Mar 1, 2018
jimczi added a commit to jimczi/elasticsearch that referenced this issue Mar 1, 2018
This change ensures that we ignore terms removed from the analysis rather than returning a match_no_docs query for the part
that contain the stop word. For instance a query like "the AND fox" should ignore "the" if it is considered as a stop word instead of
adding a match_no_docs query.
This change also fixes the analysis of prefix terms that start with a stop word (e.g. `the*`). In such case if `analyze_wildcard` is true and `the`
is considered as a stop word this part of the query is rewritten into a match_no_docs query. Since it's a prefix query this change forces the prefix query
on `the` even if it is removed from the analysis.

Fixes elastic#28855
Fixes elastic#28856
jimczi added a commit that referenced this issue Mar 4, 2018
This change ensures that we ignore terms removed from the analysis rather than returning a match_no_docs query for the part
that contain the stop word. For instance a query like "the AND fox" should ignore "the" if it is considered as a stop word instead of
adding a match_no_docs query.
This change also fixes the analysis of prefix terms that start with a stop word (e.g. `the*`). In such case if `analyze_wildcard` is true and `the`
is considered as a stop word this part of the query is rewritten into a match_no_docs query. Since it's a prefix query this change forces the prefix query
on `the` even if it is removed from the analysis.

Fixes #28855
Fixes #28856
jimczi added a commit that referenced this issue Mar 4, 2018
This change ensures that we ignore terms removed from the analysis rather than returning a match_no_docs query for the part
that contain the stop word. For instance a query like "the AND fox" should ignore "the" if it is considered as a stop word instead of
adding a match_no_docs query.
This change also fixes the analysis of prefix terms that start with a stop word (e.g. `the*`). In such case if `analyze_wildcard` is true and `the`
is considered as a stop word this part of the query is rewritten into a match_no_docs query. Since it's a prefix query this change forces the prefix query
on `the` even if it is removed from the analysis.

Fixes #28855
Fixes #28856
sebasjm pushed a commit to sebasjm/elasticsearch that referenced this issue Mar 10, 2018
This change ensures that we ignore terms removed from the analysis rather than returning a match_no_docs query for the part
that contain the stop word. For instance a query like "the AND fox" should ignore "the" if it is considered as a stop word instead of
adding a match_no_docs query.
This change also fixes the analysis of prefix terms that start with a stop word (e.g. `the*`). In such case if `analyze_wildcard` is true and `the`
is considered as a stop word this part of the query is rewritten into a match_no_docs query. Since it's a prefix query this change forces the prefix query
on `the` even if it is removed from the analysis.

Fixes elastic#28855
Fixes elastic#28856
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants