Postings based completion suggester #33160
Labels
>feature
:Search Relevance/Suggesters
"Did you mean" and suggestions as you type
Team:Search Relevance
Meta label for the Search Relevance team in Elasticsearch
Context
The
completion
suggester is a good fit for applications that have a strong a-priori weighting for each suggestion.It is optimized for speed to provide a search experience based on what the user has already typed in. However searching for suggestions cannot be mixed with filters on other fields. Users can configure
context
s to add some filtering on top of this suggester but this functionality is limited and does not provide the flexibility of a real index. Since suggestions (and contexts) are compiled in an FST at flush time the memory required by this suggester at query time depends on the size of the resulting FST. When indexing a lot of suggestions the memory needed to load theses FSTs (one per segment) can limit the available memory in a node (FST are loaded in the Java heap).Because of these restrictions we often advise users to create a custom suggester based on the inverted lists and
edge_ngram
. Using the inverted lists reduces the memory but it can impact speed. Thecompletion
suggester is able to "early terminate" a search sinceit can collect suggestions in the order of their weights, hence it much faster than a regular search that uses the inverted lists.
Proposal
Lucene 8 introduces new optimizations to search top documents sorted by score more efficiently when the total hit count is not required.
By knowing the max possible score of a block (or a segment) of documents, scorers are now able to skip non-competitive hits efficiently.
Benchmarks showed some impressive speed up for scored queries: for instance
term
.Search as you type requires speed since each keystroke generates a query so the idea here is to provide an out of the box field mapper that is configured for completion suggestions.
The score of each suggestion can be materialized in the index by replacing the frequency of terms with the a-priori weighting of the suggestions (set by the user at index time). Since we're looking for a fast search-as-you-type experience, we can index all prefixes of the suggestions in order to ensure that a single inverted list is used at query time. Using the inverted list instead of a custom FST brings
two advantages:
Mapping
The current
completion
suggester is prefix-based. We could provide the same functionality by using anedge_ngram
filter under the hood like theindex_prefixes
option does on thetext
field:However this mode would limit the scope of this suggester so we could also provide a confiuration option to index suggestions with infixes.
So instead of all prefixes we'd tokenize the input and index the prefixes of each term. To improve precision we could also index shingles (size 2 or 3?) and computes the
edge_ngram
of each shingle:Indexing
Indexing documents with suggestions would be equivalent to the current
completion
suggester:Each term (prefix or infix) is indexed with the provided weight which can be used at query time to rank the suggestions.
Querying
This field mapper uses the same structure than any other
text
field so any query that work withtext
field would be compatible. Moreover any other field can be used to filter the documents that the suggest field would match:match_phrase_prefix
could be used to search for suggestions but we could also introduce a new query: "match_prefix" that would remove the phrasing search and would only consider the last term as a prefix. This query would help to simplify the writing of suggestion query sinceprefix
queries are not analyzed.Scoring
The scoring at query time is more flexible than the current
completion
suggester thanks to thefeature
field.Each suggestion is indexed with an associated weight that is used at query time for scoring but another factor can be used to modify these static scores. For example if each document is indexed with a
popularity
feature field:The optimization for fast top docs retrieval would still apply because the
feature
query can skip block of non-competitive hits.Miscellaneous
In order to validate this proposal we'd need to perform some tests to ensure that this new field mapper can handle search-as-you-type efficiently.
A benchmark against the current
completion
suggester is needed to compare the performance. Even though we expect this new suggester to be slower than the FST-based, the new optimizations in Lucene 8 should make it fast enough to be a good candidate for a search-as-you-type experience. he other downside of this approach is the size on disk that is required to index all prefixes (infixes) so we also need to compare (and evaluate) the requirements and performance of the indexation. Since these questions could impact the development of this feature we'll start with a small POC to test these assumptions.The text was updated successfully, but these errors were encountered: