Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postings based completion suggester #33160

Closed
jimczi opened this issue Aug 27, 2018 · 2 comments
Closed

Postings based completion suggester #33160

jimczi opened this issue Aug 27, 2018 · 2 comments
Labels
>feature :Search Relevance/Suggesters "Did you mean" and suggestions as you type Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jimczi
Copy link
Contributor

jimczi commented Aug 27, 2018

Context

The completion suggester is a good fit for applications that have a strong a-priori weighting for each suggestion.
It is optimized for speed to provide a search experience based on what the user has already typed in. However searching for suggestions cannot be mixed with filters on other fields. Users can configure contexts to add some filtering on top of this suggester but this functionality is limited and does not provide the flexibility of a real index. Since suggestions (and contexts) are compiled in an FST at flush time the memory required by this suggester at query time depends on the size of the resulting FST. When indexing a lot of suggestions the memory needed to load theses FSTs (one per segment) can limit the available memory in a node (FST are loaded in the Java heap).
Because of these restrictions we often advise users to create a custom suggester based on the inverted lists and edge_ngram. Using the inverted lists reduces the memory but it can impact speed. The completion suggester is able to "early terminate" a search since
it can collect suggestions in the order of their weights, hence it much faster than a regular search that uses the inverted lists.

Proposal

Lucene 8 introduces new optimizations to search top documents sorted by score more efficiently when the total hit count is not required.
By knowing the max possible score of a block (or a segment) of documents, scorers are now able to skip non-competitive hits efficiently.
Benchmarks showed some impressive speed up for scored queries: for instance
term.
Search as you type requires speed since each keystroke generates a query so the idea here is to provide an out of the box field mapper that is configured for completion suggestions.
The score of each suggestion can be materialized in the index by replacing the frequency of terms with the a-priori weighting of the suggestions (set by the user at index time). Since we're looking for a fast search-as-you-type experience, we can index all prefixes of the suggestions in order to ensure that a single inverted list is used at query time. Using the inverted list instead of a custom FST brings
two advantages:

  • Other fields can be used to filter suggestions.
  • The memory used by this suggester is controlled by the file system cache.

Mapping

The current completion suggester is prefix-based. We could provide the same functionality by using an edge_ngram filter under the hood like the index_prefixes option does on the text field:

{
  "my_suggester": {
    "type": "completion_postings" # Name to be defined ;)
    "prefix": {}
  }
}

However this mode would limit the scope of this suggester so we could also provide a confiuration option to index suggestions with infixes.
So instead of all prefixes we'd tokenize the input and index the prefixes of each term. To improve precision we could also index shingles (size 2 or 3?) and computes the edge_ngram of each shingle:

{
  "my_suggester": {
    "type": "completion_postings" # Name to be defined ;)
    "infix": {
      "shingle_size": 2
    }
  }
}

Indexing

Indexing documents with suggestions would be equivalent to the current completion suggester:

{
  "my_suggester": {
    "input": "It's a wonderful world",
    "weight": 42
  }
}

Each term (prefix or infix) is indexed with the provided weight which can be used at query time to rank the suggestions.

Querying

This field mapper uses the same structure than any other text field so any query that work with text field would be compatible. Moreover any other field can be used to filter the documents that the suggest field would match:

{
  "bool": {
    "must": {
      "match_phrase_prefix": {
        "my_suggester": "It's a "
      }
    },
    "filter": {
      "match": {
        "category": "music"
      }
    }
}

match_phrase_prefix could be used to search for suggestions but we could also introduce a new query: "match_prefix" that would remove the phrasing search and would only consider the last term as a prefix. This query would help to simplify the writing of suggestion query since prefix queries are not analyzed.

Scoring

The scoring at query time is more flexible than the current completion suggester thanks to the feature field.
Each suggestion is indexed with an associated weight that is used at query time for scoring but another factor can be used to modify these static scores. For example if each document is indexed with a popularity feature field:

"bool": {
  "must": {
    "match_phrase_prefix": {
      "my_suggester": "It's a ",
      "boost": 0.1
    }
  },
  "should": {
    "feature": {
      "field": "popularity"
    }
  }
}

The optimization for fast top docs retrieval would still apply because the feature query can skip block of non-competitive hits.

Miscellaneous

In order to validate this proposal we'd need to perform some tests to ensure that this new field mapper can handle search-as-you-type efficiently.
A benchmark against the current completion suggester is needed to compare the performance. Even though we expect this new suggester to be slower than the FST-based, the new optimizations in Lucene 8 should make it fast enough to be a good candidate for a search-as-you-type experience. he other downside of this approach is the size on disk that is required to index all prefixes (infixes) so we also need to compare (and evaluate) the requirements and performance of the indexation. Since these questions could impact the development of this feature we'll start with a small POC to test these assumptions.

@jimczi jimczi added >feature :Search Relevance/Suggesters "Did you mean" and suggestions as you type labels Aug 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jimczi
Copy link
Contributor Author

jimczi commented Apr 3, 2019

This feature is implemented in #35600, hence closing.

@jimczi jimczi closed this as completed Apr 3, 2019
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Relevance/Suggesters "Did you mean" and suggestions as you type Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

3 participants