-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unexpected Ranking Behavior in Hybrid Query with Min-Max Normalization and Arithmetic Mean Combination #910
Comments
Thanks @rohantilva. We will move this to the Neural Search repository. |
Hi @rohantilva,thanks for creating this issue. Can you share some insights on the shape of search request? As per the example you shared in the PR description, B should not be missed. To further investigate the issue, I need to reproduce the issue on my end. It would be great if you share the steps to reproduce it. |
Thank for for the feedback @rohantilva, I'd like to add one more ask to the previous request - can you please also share the exact requests used to obtain those raw scores mentioned in the header:
|
@vibrantvarun @martin-gaievski Thanks for jumping on this. Some details below/attached. Hybrid query request: this is an example of a similar looking request (I trimmed some of the extraneous fields to remove sensitive information). It's a hybrid query executing 2 queries (keyword match + knn semantic search), where the weight of the first query is set to 1 and the weight of the second query is set to 0 (note: I set these weights intentionally to illustrate the bug in full effect). Also note: I've removed the actual embeddings from the request (hence Screenshots of results: there are three "sections" in the screenshot, which show a couple things:
Btw, I am using the AWS managed Opensearch, version 2.15. I know there could be some drift between that and opensource version 2.15, so just wanted to point that out. |
Hey @rohantilva I will look into this today. |
Hey, @rohantilva I just verified that hybrid query works as expected. I have some concerns regarding the screenshot you shared. The subqueries when ran individually yields 5 results and hybrid query result count is 8, considering 2 duplicate documents which are part of result set from both subqueries. However, I can clearly see those duplicate results are not even present in the hybrid query result set? I think there is some other issue which is impacting this? How come documents starting with
are part of hybrid result when either of the subqueries do not return it? |
Also I just wanted to confirm that you ran the subquery individually by running it in standalone manner, not under the hybrid clause. |
The reason why the individual subqueries yielded 5 results was because I explicitly requested 5 documents for the subqueries (again, executed individually), and >5 documents for the hybrid query. If I request only 5 documents for the hybrid query, I get the same results, excluding the last 3 pictured in the screenshot I sent. |
What is URL you guys are hitting for hybrid query? Can you share that. |
@vibrantvarun we're using this API: https://opensearch.org/docs/latest/api-reference/multi-search/ |
Got it yeah so I think this is the issue. Can you just use the hybrid query by following the documentation https://opensearch.org/docs/latest/search-plugins/hybrid-search/#step-5-search-the-index-using-hybrid-search |
I also tried the attached json with query example, it doesn't work for me. Can you please check one more time that file got actually uploaded to github |
I just tried using |
there are parts of this query (like the query vector embeddings) that I left out intentionally. what do you mean it doesn't work exactly? Also - I double checked that the file is uploaded to GH |
Try |
There is already an issue on OpenSearch core 15748 |
@vibrantvarun is is not possible to pass |
As of now No. Once the PR which I mentioned above gets merged then you can pass it in the request. |
@vibrantvarun if we have a pipeline defined, can we do an |
Currently, please use |
sure. and @vibrantvarun the PR description mentions we can mention the pipeline name, so can we not pass the pipeleine value dynamically in the request? |
Yes |
@rohantilva @vibrantvarun PR is merged opensearch-project/OpenSearch#15923 and will be released in 2.18 |
Mentioned core PR has been merged, @rohantilva will you be able to verify fix using open source release version of 2.18? |
Opensearch Version: 2.15
Environment: AWS OpenSearch
Issue Description
I am executing hybrid queries with three sub-queries on a large dataset containing tens to hundreds of thousands of documents. The queries are weighted as follows:
[0.9998, 0.0001, 0.0001]
, with the first query having the highest weight. However, I am seeing unexpected results where a document with a high score from the first query is missing from the top results in the final ranking, while documents with lower scores from the same query are included.Example:
However, in the hybrid query, Document B does not appear in the top results, but Document C does, despite the heavily skewed weighting toward the first query (0.9998).
Pipeline Configuration:
Observations:
Essentially, even if Document C returns the highest possible scores from queries 2 and 3, it cannot score higher than Document B. Given this, it seems impossible for Document B to not appear in the final results, and Document C should not rank higher.
Question:
How is it possible for Document B to be excluded from the top results while Document C is included, given the heavily skewed weights and expected normalization?
Related component
Search:Relevance
Expected behavior
I would expect Document B to appear in the hybrid query search results no matter what, given the weight we've assigned to the first query.
The text was updated successfully, but these errors were encountered: