[BUG] Unexpected Ranking Behavior in Hybrid Query with Min-Max Normalization and Arithmetic Mean Combination #910

rohantilva · 2024-09-13T00:49:42Z

Opensearch Version: 2.15
Environment: AWS OpenSearch

Issue Description

I am executing hybrid queries with three sub-queries on a large dataset containing tens to hundreds of thousands of documents. The queries are weighted as follows: [0.9998, 0.0001, 0.0001], with the first query having the highest weight. However, I am seeing unexpected results where a document with a high score from the first query is missing from the top results in the final ranking, while documents with lower scores from the same query are included.

Example:

Documents: A, B, C, D
Query 1 Scores (when run independently):
- Document A: 1200
- Document B: 1000
- Document C: 300
- Document D: 100

However, in the hybrid query, Document B does not appear in the top results, but Document C does, despite the heavily skewed weighting toward the first query (0.9998).

Pipeline Configuration:

{
  "phase_results_processors": [
    {
      "normalization-processor": {
        "combination": {
          "parameters": {
            "weights": [
              0.9998,
              0.0001,
              0.0001
            ]
          },
          "technique": "arithmetic_mean"
        },
        "normalization": {
          "technique": "min_max"
        }
      }
    }
  ]
}

Observations:

Essentially, even if Document C returns the highest possible scores from queries 2 and 3, it cannot score higher than Document B. Given this, it seems impossible for Document B to not appear in the final results, and Document C should not rank higher.

Question:

How is it possible for Document B to be excluded from the top results while Document C is included, given the heavily skewed weights and expected normalization?

Related component

Search:Relevance

Expected behavior

I would expect Document B to appear in the hybrid query search results no matter what, given the weight we've assigned to the first query.

The text was updated successfully, but these errors were encountered:

getsaurabh02 · 2024-09-18T16:30:33Z

Thanks @rohantilva. We will move this to the Neural Search repository.

vibrantvarun · 2024-09-19T06:40:51Z

Hi @rohantilva,thanks for creating this issue. Can you share some insights on the shape of search request? As per the example you shared in the PR description, B should not be missed. To further investigate the issue, I need to reproduce the issue on my end. It would be great if you share the steps to reproduce it.

martin-gaievski · 2024-09-19T18:34:36Z

Thank for for the feedback @rohantilva, I'd like to add one more ask to the previous request - can you please also share the exact requests used to obtain those raw scores mentioned in the header:

Documents: A, B, C, D
Query 1 Scores (when run independently):
Document A: 1200
Document B: 1000
Document C: 300
Document D: 100

rohantilva · 2024-09-19T20:43:14Z

@vibrantvarun @martin-gaievski Thanks for jumping on this. Some details below/attached.

Hybrid query request: this is an example of a similar looking request (I trimmed some of the extraneous fields to remove sensitive information). It's a hybrid query executing 2 queries (keyword match + knn semantic search), where the weight of the first query is set to 1 and the weight of the second query is set to 0 (note: I set these weights intentionally to illustrate the bug in full effect). Also note: I've removed the actual embeddings from the request (hence "vector": []).

hybrid_query.json

Screenshots of results: there are three "sections" in the screenshot, which show a couple things:

first section: shows results from Opensearch when the keyword query (first subquery within the hybrid query) is executed individually
second section: shows results from Opensearch when the knn/semantic query (second subquery) is executed individually
third section: shows results from the hybrid query - you can clearly see that even though the weights of the queries are set to [1, 0], the hybrid query results do not match the results from the keyword query.

Btw, I am using the AWS managed Opensearch, version 2.15. I know there could be some drift between that and opensource version 2.15, so just wanted to point that out.

vibrantvarun · 2024-09-23T20:13:17Z

Hey @rohantilva I will look into this today.

vibrantvarun · 2024-09-24T21:41:11Z

Hey, @rohantilva I just verified that hybrid query works as expected. I have some concerns regarding the screenshot you shared. The subqueries when ran individually yields 5 results and hybrid query result count is 8, considering 2 duplicate documents which are part of result set from both subqueries. However, I can clearly see those duplicate results are not even present in the hybrid query result set? I think there is some other issue which is impacting this?

How come documents starting with

Checkissuing ...

are part of hybrid result when either of the subqueries do not return it?

vibrantvarun · 2024-09-24T21:43:30Z

Also I just wanted to confirm that you ran the subquery individually by running it in standalone manner, not under the hybrid clause.

rohantilva · 2024-09-24T21:58:30Z

@vibrantvarun

yes, I executed the queries individually
pretty much this bug relates to why the results from the hybrid query do not match the results from the subqueries - so your question of How come documents starting with 'Checkissuing...' are part of hybrid result when either of the subqueries do not return it? is exactly what my question is. I'm not sure why, when weights are set to [1, 0], the hybrid query results do not match the results from the first query when executed individually

The reason why the individual subqueries yielded 5 results was because I explicitly requested 5 documents for the subqueries (again, executed individually), and >5 documents for the hybrid query. If I request only 5 documents for the hybrid query, I get the same results, excluding the last 3 pictured in the screenshot I sent.

vibrantvarun · 2024-09-25T18:41:23Z

What is URL you guys are hitting for hybrid query? Can you share that.

rohantilva · 2024-09-25T18:51:17Z

What is URL you guys are hitting for hybrid query? Can you share that.

@vibrantvarun we're using this API: https://opensearch.org/docs/latest/api-reference/multi-search/

vibrantvarun · 2024-09-25T18:57:35Z

Got it yeah so I think this is the issue. Can you just use the hybrid query by following the documentation https://opensearch.org/docs/latest/search-plugins/hybrid-search/#step-5-search-the-index-using-hybrid-search

martin-gaievski · 2024-09-25T19:00:41Z

I also tried the attached json with query example, it doesn't work for me. Can you please check one more time that file got actually uploaded to github

rohantilva · 2024-09-25T19:20:01Z

Got it yeah so I think this is the issue. Can you just use the hybrid query by following the documentation https://opensearch.org/docs/latest/search-plugins/hybrid-search/#step-5-search-the-index-using-hybrid-search

I just tried using /_search - I'm getting the same issue (order between keyword query and hybrid query do not match even though the weight for the keyword query is set to 1)

rohantilva · 2024-09-25T19:21:57Z

I also tried the attached json with query example, it doesn't work for me. Can you please check one more time that file got actually uploaded to github

there are parts of this query (like the query vector embeddings) that I left out intentionally. what do you mean it doesn't work exactly? Also - I double checked that the file is uploaded to GH

vibrantvarun · 2024-09-25T19:26:49Z

Try /_search?search_pipeline=nlp-search-pipeline. The PR to add support for search pipelines with _msearch is under review and will be released soon.

vibrantvarun · 2024-09-25T19:27:56Z

There is already an issue on OpenSearch core 15748

rohantilva · 2024-09-25T19:29:24Z

Try /_search?search_pipeline=nlp-search-pipeline. The PR to add support for search pipelines with _msearch is under review and will be released soon.

@vibrantvarun is is not possible to pass search_pipeline in the query body (see json file I attached originally) rather than as a query param to the API itself?

vibrantvarun · 2024-09-25T19:30:24Z

As of now No. Once the PR which I mentioned above gets merged then you can pass it in the request.

harshatba · 2024-09-25T19:33:56Z

@vibrantvarun if we have a pipeline defined, can we do an msearch or we should just use the _search endpoint?

vibrantvarun · 2024-09-25T19:36:53Z

Currently, please use _search?search_pipeline= <normalization pipeline name> endpoint. In future, when this PR gets merged in the OpenSearch core, you can use it with msearch as well.

harshatba · 2024-09-25T19:40:07Z

sure. and @vibrantvarun the PR description mentions we can mention the pipeline name, so can we not pass the pipeleine value dynamically in the request?

vibrantvarun · 2024-09-25T19:41:16Z

Yes

owaiskazi19 · 2024-09-27T03:00:56Z

@rohantilva @vibrantvarun PR is merged opensearch-project/OpenSearch#15923 and will be released in 2.18

martin-gaievski · 2024-10-15T04:24:46Z

Mentioned core PR has been merged, @rohantilva will you be able to verify fix using open source release version of 2.18?

rohantilva added bug Something isn't working untriaged labels Sep 13, 2024

getsaurabh02 removed the untriaged label Sep 18, 2024

getsaurabh02 transferred this issue from opensearch-project/OpenSearch Sep 18, 2024

github-actions bot added the untriaged label Sep 18, 2024

martin-gaievski removed the untriaged label Sep 19, 2024

vibrantvarun self-assigned this Sep 25, 2024

jmazanec15 added the v2.18.0 label Oct 2, 2024

vibrantvarun assigned martin-gaievski and unassigned vibrantvarun and martin-gaievski Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unexpected Ranking Behavior in Hybrid Query with Min-Max Normalization and Arithmetic Mean Combination #910

[BUG] Unexpected Ranking Behavior in Hybrid Query with Min-Max Normalization and Arithmetic Mean Combination #910

rohantilva commented Sep 13, 2024 •

edited

Loading

getsaurabh02 commented Sep 18, 2024

vibrantvarun commented Sep 19, 2024 •

edited

Loading

martin-gaievski commented Sep 19, 2024

rohantilva commented Sep 19, 2024 •

edited

Loading

vibrantvarun commented Sep 23, 2024

vibrantvarun commented Sep 24, 2024

vibrantvarun commented Sep 24, 2024

rohantilva commented Sep 24, 2024

vibrantvarun commented Sep 25, 2024

rohantilva commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024 •

edited

Loading

martin-gaievski commented Sep 25, 2024

rohantilva commented Sep 25, 2024

rohantilva commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

rohantilva commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

harshatba commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

harshatba commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

owaiskazi19 commented Sep 27, 2024

martin-gaievski commented Oct 15, 2024

[BUG] Unexpected Ranking Behavior in Hybrid Query with Min-Max Normalization and Arithmetic Mean Combination #910

[BUG] Unexpected Ranking Behavior in Hybrid Query with Min-Max Normalization and Arithmetic Mean Combination #910

Comments

rohantilva commented Sep 13, 2024 • edited Loading

Issue Description

Example:

Pipeline Configuration:

Observations:

Question:

Related component

Expected behavior

getsaurabh02 commented Sep 18, 2024

vibrantvarun commented Sep 19, 2024 • edited Loading

martin-gaievski commented Sep 19, 2024

rohantilva commented Sep 19, 2024 • edited Loading

vibrantvarun commented Sep 23, 2024

vibrantvarun commented Sep 24, 2024

vibrantvarun commented Sep 24, 2024

rohantilva commented Sep 24, 2024

vibrantvarun commented Sep 25, 2024

rohantilva commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024 • edited Loading

martin-gaievski commented Sep 25, 2024

rohantilva commented Sep 25, 2024

rohantilva commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

rohantilva commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

harshatba commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

harshatba commented Sep 25, 2024

vibrantvarun commented Sep 25, 2024

owaiskazi19 commented Sep 27, 2024

martin-gaievski commented Oct 15, 2024

rohantilva commented Sep 13, 2024 •

edited

Loading

vibrantvarun commented Sep 19, 2024 •

edited

Loading

rohantilva commented Sep 19, 2024 •

edited

Loading

vibrantvarun commented Sep 25, 2024 •

edited

Loading