-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] High Level Approach and Design For Normalization and Score Combination #126
Comments
When the KNN query and the lexical search query has completely different set of documents (no common docs), then knn and BM25 scores are calculated for all the documents independently, normalised and combined for every document on the co-ordinator node, then sorted and rendered. Is my understanding right about how this feature will work ? |
@prasadnu So when you say all the documents you mean all the documents returned from shards or all the documents present in cluster? |
@navneet1v Thanks for the response. Let me elaborate a little, Under this new feature, say, I have a compound search query with 2 queries inside, one doing the KNN search and another doing the lexical search which gets triggered in parallel. When the final resultant documents from both the queries are completely disjoint (d1,d2,d3,d4 for knn search and d5,d6,d7,d8 for lexical search), then both the knn and bm25 scores are calculated for all the resultant documents (d1,d2,d3,d4,d5,d6,d7,d8), normalised and combined at the global level, then sorted and rendered. Is this the desired outcome ? |
@prasadnu As per this proposal the ans to this question is no. But this is really interesting use-case. We can calculate the scores for all the documents(both lexical and K-NN) but my worry over here is will that be optimal and in worst case this can slow down the query. Please let me know if this kind of use case can be very prominent. All other parts of your understanding is correct. |
@navneet1v If both the scores are not calculated for all the resultant documents (the union set of docs from knn and lexical search), How can the scores be combined ? Will the compound query only take the common list of documents (the common docs that resulted in both the searches) and combine the scores ? If yes, then all the times, we cannot expect both the search queries to have some docs in common, there may be scenarios where neural search and the lexical search will give disjoint set of docs and in those cases we cannot combine the scores when we don't have the both the scores calculated for each resulted doc. I have attached a screenshot on how possibly hybrid ranking can work for disjoint resultant docs from the different queries. Please clarify. |
In addition to the above, if there are common docs as results from the queries, they can be always boosted at the top (based on their sum score) at the co-ordinator level |
@prasadnu The reason behind this, for a good enough size of corpus/index this is very less likely to happen. @MilindShyani for providing more details. |
In approximate kNN, the min value might not be known unless an exact kNN search is run which might not be a good idea for large-scale kNN databases. I think for these cases there should be an option to provide a default minimum value (or maybe also a default max value) so the normalization can be done based on that. For example, if the metric is cosine similarity, we know that the min value is -1 so we can use that instead of an exact kNN search. Also, when there are documents that are not present in all queries, there should be a default score that would work as 0 in combining the scores. I also wonder how problems like this can be handled since they come up in scenarios where approximate kNN is combined with other queries. |
@SeyedAlirezaFatemi
Yes, I am aware of this, we never plan to use exact K-NN. I see that there are some confusions getting created around this, I will go ahead and update the proposal to make it clear that we are not going to use exact k-NN. To solve the problem what we are proposing here is once all the shards finds out the doc Ids and return back to coordinator node we will find the min score of K-NN from those scores at the Coordinator node level. This will be a good approximation for min score of K-NN.
This is a good idea for min score, I will think over it and incorporate it as an alternative in the proposal.
For default score, for document which is not present in either of the query will be 0. I thought this will be explicit because this is how query clauses like should, works under bool. But I will go ahead and update the proposal to make it more clear. I hope this helps.
For this, I have added a section for future enhancement. In the first phase we are not thinking to handle the paging queries. |
I understand that pagination is not the priority here (though I think it's a crucial aspect of this feature that would enable it to be used in many products since paging is present in many use cases), but if we use the min score of kNN that we got from the first top k results, then if we increase this k (by going to next pages or some other scenario), that min score will change and that means the kNN scores after normalization will change and that affects the ranking. Let's assume we use this formula for score normalization of kNN scores: I think this shows that having that default min score can be really important. |
I think it is important to consider why normalization is used in the first place. Normalization is useful when combining results from different retrievers (for us that's BM25 and kNN). If we are just using kNN it doesn't matter what normalization we use (as long as it is a monotonic function). For instance in the example above one doc has a score 1 and nine docs have 0.9. After normalization one doc has 1 and the nine docs have score 0. Note that the ranking has not changed. Since 9 documents having a tie with score 0.9 is the same as 9 documents having a tie with score 0. The real problem comes when we want to combine. There the score being 0.9 or 0 does lead to a difference. But during combination, we do not know what is the right way to decide the min score. What if the default min score is bigger than some score in which case we can get negative scores. Many transformers use dot product for retrieval and the score has no definite range -- how do we select the min/max there? How do we decide the min/max scores for kNN and BM25 -- should they be the same or different? These are tricky questions and the answers depend on what dataset is being used. What we do instead is appeal to some weak law of large numbers. If we have enough documents per shards and each shard has a diverse enough set of documents, the chances of us getting the cases you mentioned (which are indeed problematic for combination) are pretty slim. In other words, on average we expect the results to be fair. Hope that addresses some of your concerns. Please feel free to ask clarifications/comments. |
Thanks for the response. Sorry if I wasn't clear with that example. I meant that in the context of using kNN with some text match query. As you mentioned, if there is only a single kNN query, normalization doesn't make sense. But if there was also a text query combined, there would be a problem.
You are right that having this default min score isn't the best option for all scenarios. But at least for BM25 and cosine similarity, it does make sense (0 and -1 being the min for each respectively). Maybe it would be nice to have the feature to optionally provide the min/max score for each query part and if it is not defined in the query, then the min/max would be calculated using the retrieved data as you proposed. |
@SeyedAlirezaFatemi Thanks for providing the feedback. I will keep this in mind when developing the API interface for the new Query type. I would recommend you creating a feature request for this so that we can discuss more on the use-case and why it is needed. |
@navneet1v We can of course just show the |
Updated the Description to add the Science Benchmarks link for reference: https://opensearch.org/blog/semantic-science-benchmarks/ |
@navneet1v Thanks for leading this effort! Much needed if we want to be able to blend lexical and semantic search results together in a meaningful way. You reference this keystone paper multiple times in this RFC, yet I don't see any mention of RRF (Reciprocal Rank Fusion) anywhere as a potential solution to blend results together, even though Elastic has demonstrated that it is a very viable solution to not have to mess with score normalization. Note that while "viable" doesn't mean "best", RRF is very simple to use and implement, which is also (very) important in terms of usability for the end users. PS: Even searching org-wide for Can you provide any insights on why RRF is not being considered at all in this context even though OS is trying to solve the same issue that ES partly solved already? Thanks much in advance! |
Hi @consulthys, thanks for providing this info. Based on my reading and knowledge the RRF is supported in OpenSearch. Similar to what is provided in the blog. This feature actually tries to tackle one more key idea which is putting the weights at same scale for 2 different types of queries by looking at whole corpus of results from different shards. Basically providing a support for doing score normalization. Now once we have scores normalized there can be different ways to combine the scores. Right now we are enabling the linear score combination, but I don't see a reason why this cannot be extended to have support for RRF combination. I am adding @MilindShyani on this thread to provide more details. Feel free to correct me if there is something is wrong. |
@consulthys RFF is indeed a simple yet effective protocol with non-trivial benefits. I believe it is possible to execute rank fusion in most query DSL's out of the box. From what I gather, it should be easy to implement it within OpenSearch. But as you noted, we should highlight this somewhere (perhaps in a tutorial/blog), so that users can start executing hybrid searches right away. However, given the diversity of search domains there does not exist a one-fit-all solution. And for many domains, as has been shown in the literature, score combination can perform better. Combination requires score normalization, which is what this RFC is about. |
Thanks @navneet1v and @MilindShyani for chiming in and sheding some light on this matter. I know there's no one-size-fits-all solution and I'm not diminishing the benefits from this RFC whose goal is perfectly clear (score normalization), but if RRF is supposedly already supported in OpenSearch, yet there is no single mention of it anywhere, I think it is a BIG omission that should be fixed as soon as possible. If any of you or anyone else could show an example of how RRF can be implemented (as of 2.9), I'm curious and it would be awesome, thanks in advance. |
@navneet1v @MilindShyani |
Yes this feature is planned for 2.10. The code is completed and merged. We are doing some testing and benchmarks as of now. |
Can you show an example of how RRF can be implemented (as of 2.9), or refer me to someone who knows how to do it? |
Of course! Let me try, Using a python client (which unfortunately is not ideal) we could do something like,
May be @navneet1v could help us how to do this with query DSL? |
Resolving this github issue as the changes for RC of 2.10 is finalized and merged. Please create a github issue if there are any further questions. |
Thank you for the example you shared, but that's a client-side solution outside of OpenSearch not within. I thought you referred to a magical trick that I didn't know about and allowing you to do it in a script or something. Anyway, I'm eager to see OS 2.10 being released soon (maybe next Monday Sep 25th?) to try out hybrid search. Thanks for your hard work |
@consulthys yes the release for 2.10 is on Monday September 25th.
Currently with opensearch if we need to use RRF we need to do it outside of OpenSearch. But with this Normalization Feature, we added some generic Components like SearchPhaseResults Processor which is also going live in 2.10, I see 2 ways to do this:
If you are interested in that feature please feel to cut a github issue or contribute. I see we have all the things available in the plugin to build RRF technique. |
Introduction
This issue talks about the various high level directions which are being proposed for Score combination and Normalization techniques to improve Semantic Search/Neural Search Queries in OpenSearch(META ISSUE: #123). The proposals tries to use already created extensions in OpenSearch as much as possible. Also, we try to make sure that directions we are choosing are long term; provides different level of customizations for users to tweak the Semantics Search as per their needs. The document also proposes phased design and implementation plan which will help us to improve and add new feature with every phase.
For information how normalization improves the overall Quality of Results please refer this OpenSearch Blog for Science Benchmarks: https://opensearch.org/blog/semantic-science-benchmarks/
Current Architecture
For simplicity, let's us consider a 3 node OpenSearch cluster, where we have 2 data nodes and 1 coordinator node. The data node stores the data and coordinator node helps in running the request. This is how OpenSearch works on a very high level.
Requirements
The customer here is referenced as the OpenSearch customers, who wants to use OpenSearch and wants to run Semantics Search in their application.
Why current requirement cannot be fulfilled with current OpenSearch Architecture?
Currently, OpenSearch uses a Query and Fetch model to do the search. In this, first the whole query object is passed to all the shards, to obtain the top results from those shards and then fetch phase begins which fetches the relevant information for the documents. If we talk about 2 types of queries in a typical Semantic Search use case will have a K-NN query and text match query both of which produces scores using different scoring methods.
So at first, it is difficult to combine result sets whose scores are produced via different scoring methods. In order to effectively combine results, the different queries scores would need to be put on the same scale(refer https://arxiv.org/abs/2210.11934). By this, we mean is that a score would need to meet 2 requirements: (1) indicates its relative relevance between it and the other documents scored in the query and (2) be comparable with the relative relevance of results from other queries. For example, for k-NN, the score range may be 0-1 while BM25 scoring would be 0-Float.MAX. Hence any effective combination query clause like Bool will suffer the problems.
Second, it is not possible to consider global hits for re-ranking. Because scores are assigned at the shard level, any rescoring will be done at the shard level. Hence if we try to normalize the scores, the normalization will be local normalization and not the global normalization.
Let’s use below example to understand the problem in more details.
Example
Using the same cluster setup as defined on top and look at an example to understand the above 2 problems. For this example lets assume that we have an index whose schema looks like this:
The title and description are product title and description. The tile_and_descrption_knn field is a K-NN vector field which has a 768 dimensions dense vector created using Dense Vector Model.
Query
We are using Bool query clause to combine the results of K-NN(neural query converted to K-NN Query) and a text based search query. Bool Query should clauses have their scores combined — the more matching clauses, the better.
Scores Computation Examples Happening at different Levels
Combination using above provided query: As we can see here because both K-NN and BM-25 scores are at different scale if one of the query behaves badly like example for document d8, it is still coming as the first response because of BM-25. The standard boolean combination is not taking the advantage of relative ranking. Documents like d7 are still lower down the results even when it has good scores in both BM-25 and K-NN. This problem will become more elevated as BM-25 scores are unbounded.
To solve this problem of joining scores for different queries running at different scale is to first Normalize the scores of both the queries and then combine the scores. We can read about the same here(https://arxiv.org/abs/2210.11934).
Using Normalization: To see how normalization works let’s look at the below table. In the below table we have done 2 types of normalization one which is done for every shard/data node and another at the Coordinator node(Global Normalization).
Note: I did the normalization using the scores present here hence documents have 0 scores after normalization.
Final Sorted Documents to be returned based on above examples:
If we keep our focus on the d8 document we can see how that document changes its position based on without normalization, with Local Normalization and with Global Normalization. As Local Normalization only considers the scores at per shard level, it can suffer in cases if one of the score is lowest. But in the Global Normalization(aka Normalization at Coordinator Node level), as we start to look scores from whole corpus it smoothens out the above problem, because it can happen that more bad scores can come from other shards. We did experiments on that to verify it.
System Requirements
Functional Requirements:
Good to Have but not as P0:
High Level Directions for Building the Solution
If we look at the requirements we see that we need solutions at different level of the whole search api flow. We can divide the whole flow in 3 parts:
Defining _search API Input
The proposal for the input is to use the _search api and define a new compound query clause. This new compound query clause will hold the array of Queries which will be executed in parallel at per data node level. The name of the new query clause is not yet defined. The interface of this new query clause is inspired from dis_max query clause. But dis_max query clause runs the queries sequentially. The new query clause will make sure that Scores are calculated at shard level independently for each sub query. The sub-queries re-writing will be done at Coordinator level to avoid duplicate computations.
Note: The interfaces defined here are not finalized. The interfaces will be refined as part of LLD github proposals. But we want to make sure that we align ourselves with high level approach. first
Pros:
Cons:
Alternatives Considered
Alternative-1: Implement a new Rest Handler instead of using creating a new compound query
The idea here to create a new rest handlers which define the list of queries whose scores needs to be normalized and combined.
Pros:
Cons:
Obtaining Relevant Information for Normalization and score Combination
This section talks about how OpenSearch will get the relevant information required for doing the Normalization. For example purpose lets say customer has defined the min-max normalization so for every subquery we will need the min and max sore for the whole corpus.
OpenSearch during the query phase it uses a QueryPhaseSearcher class to do the Query and collect the documents at shard level using TopDocsCollector interface. There is no extension point present in the QueryPhaseSearcher to use a different implementation of TopDocsCollector. The only extension we have is that a plugin can define a new QueryPhaseSearcher implementation. So we will define a new QueryPhaseSearcher implementation which will implement a new TopDocsCollector interface at shard level to gather the relevant information for doing normalization.
Pros:
Cons:
Alternatives Considered
Alternative1: Enhance DFS Query Search type or create a new Query Search type to get information for doing normalization
The default search type in OpenSearch is Query and the Fetch, which first query the results and then fetch the actual source from the shards. The DFS query is another search type which customer can put as a query params to change the search type. In DFS Query and Fetch, OS will first Prequery each shard asking about Term and Document frequencies and send this information to all the shards where scores are calculated using this global Term/Document Frequencies.
We can build similar to this, where we can do the pre-query to find the min-max scores from all the shards and then pass this information where each shard can do the normalization of the scores for each sub-query.
Pros:
Cons:
Normalizing and Combining Scores using Search Pipeline
The idea here is to extend the Search Pipeline Request and Response transformers to create another type of Transformer which will be called after the query phase is completed. We will use this transformer interface to do the normalization and score combinations for the document ids returned from Query phase as per the user inputs. The transformed result will then be passed on the Fetch phase which will run as it is.
Below is the modified input of the above proposed API input. It adds relevant fields to do the Normalization and Score Combination.
Note: The interfaces defined here are not finalized. The interfaces will be refined as part of LLD github proposals. But we want to make sure that we align ourselves with high level approach.
Alternatives Considered
Alternative-1: Create a new phase in between query and fetch phase
The high level idea here is to create a phase which runs in between the Query and Fetch phase which will do the Normalization.
Pros:
Cons:
Alternative-2: Create a Fetch Subphase which do the Normalization and score combination
OpenSearch provides an extension where plugins can add Fetch subphases which will run at the end when all Core Subphases are executed. We can create a Fetchsubphase that will do the normalization and score combination. But problem with this, as we have multiple sub-queries we need to change the interfaces to make sure that all the information required for Normalization needs to be passed in. This will result in duplicated information and multiple hacks to pass through the earlier fetch phases.
Pros:
Cons:
Alternative-3: Use SearchOperationListeners interface
SearchOperationListener runs at a Shard Level and not the Coordinator Node Level. Please check this, and this code reference. Hence we cannot use SearchOperationListeners. As we need the normalization to be done at Coordinator Node Level.
High Level Diagram
Based on the above 3 directions, below is the high level flow diagram. The 2 subqueries are provided as an example. There will be a limit on how many subqueries a customer can define in the query clause. This max number for the subqueries we will be keeping is 10(There is no specific reason for this limit, just want to make sure that we have limits imposed to make sure to avoid long running queries leading to Circuit Breaker and Cluster failures).
** There can be many sub-queries in the new compound query.**
Future Scope or Enhancements which will be picked up during next phases
Implementing Pagination in new Compound Query Clause
The proposed design doesn’t support for the Paginated Queries in the first phase to reduce the scope of the phase-1 launch. We also have not done deep-dive on how this can be implemented and what is the current solution of pagination.
Enhance the explain query functionality for new compound Query Clause
With the phase-1 implementation of the new query we won’t be providing the explain functionality for the query clause. Explain api provides information about why a specific document matches (or doesn’t match) a query. This is very useful api for customers to understand and do debugging.
Enabling Parallel/Concurrent Search on Segments functionality for new Compound Query
The idea here is to enable the parallel search for all the different queries provided in the Compound Query to improve the performance for this query. Parallel Search on Segments is already in sandbox for OpenSearch(https://github.com/opensearch-project/OpenSearch/tree/main/sandbox/plugins/concurrent-search).
Implement Script based Combination to allow customers to use any Score combination Techniques
The initial proposal provides customers only a set of functions to combine the scores with Script based combination customer can define custom scripts that can be used to combine the scores before we rank them.
Integrating the compound query to be written via Querqy etc to provide better experience for customers
The idea here is the new compound query clause can become overwhelming, hence we want to integrate it with different Query writing helpers like Querqy etc to facilitate easy query writings for customers.
Launch Plan
Below are some high level phased approach for building the feature. These are not set in stone and may change as we start making progress in the implementation
Phase-1
Given that we are defining new compound query clause for OpenSearch we will launch these features defined in this document and high level design under feature flag. High Level items:
Phase-2
The phase-2 will focus on these items:
Phase-3
The phase-3 will focus on these items:
Phase-4
By this time we will have a good understanding of how customer are using this new compound query. This phase will start to focus on how we can now make it easier for customer to start using this new query clause. The below item helps us in doing that:
FAQ:
Does other products support combining scores at Global Level?
Yes, Elastic Search supports this feature. But it combines the results of K-NN query with the text match query only. It is not generic enough. Also ES doesn’t support normalization of scores globally. Reference: https://www.elastic.co/guide/en/elasticsearch/reference/master/knn-search.html#_combine_approximate_knn_with_other_features.
As per the documentation limitation is:
Approximate kNN search always uses the dfs_query_then_fetch search type in order to gather the global top k matches across shards. You cannot set the search_type explicitly when running kNN search.
Example Query:
Open Questions
Question1: Should we consider the case where subqueries are getting normalized using different techniques or customer only wants to normalize the single query and not the other queries?
After discussing, this is a very valid use case. But we don’t have enough data to prove this hypothesis. This will really depend on customer use case. I would suggest to start with lets do normalization on all the sub-queries as we can see from this blog that we should do normalization on all the sub-queries. Also as this is not a one way door.
Question2: How we are going to calculate the min score for the queries globally?
We will have a min score from different shards but as of now we don’t have a way to find the global min score for the K-NN queries. To do that we need to run exact K-NN. As of now I am trying to find a way to do this. For text matching, as we will iterate over the all the segments we will have min score. But more deep-dive is required the feasibility of the solution.
Next Steps:
Appendix
What is Normalization?
Normalization is a data transformation process that aligns data values to a common scale or distribution of values.
Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.
What are different ways to do the normalization?
Score Combination Techniques
Now you have 2 or more results. You need to combine both of them. There can be many way you can combine the results(Geometric Mean, Airthmatic Mean etc).
Approach 1: Normalized arithmetic mean
Assume we have 2 sets of results, resultsa and resultsb. Each result has a score and a document id. First, we will only consider the intersection of results in a and b (i.e. resultsc=resultsa∩resultsb). Then, each document in resultsc will have 2 scores: one from a and one from b. To combine the scores, we will first normalize all scores in resultsa and resultsb, and then take the arithmetic mean of them:
Approach 2: Normalized geometric mean
Similar to Approach 1, but instead of taking the arithmetic mean, we will take the geometric mean:
score=(norm(scorea)∗norm(scoreb))
Approach 3: Normalized harmonic mean
Similar to Approach1, but instead of taking the arithmetic mean, we will take the harmonic mean:
Approach 4: Normalized Weighted Linear Combination
Instead of taking the mean of the scores, we can just try different weights for each score and combine them linearly.
Approach 5: Normalized Weighted Geometric Combination
Similar to above approach, but instead of combining with addition, we can combine with multiplication:
This approach has previously been recommended for score combination with OpenSearch/Elasticsearch: elastic/elasticsearch#17116 (comment).
Approach 6: Interleave results
In this approach, we will produce the ranking by interleaving the results from each set together. So ranking 1, 3, 5, ... would come from resultsa and 2, 4, 6, ... would come from resultsb.
Reference Links:
The text was updated successfully, but these errors were encountered: