Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [hybrid_search] The rerank effect needs more improvements when setting different metric type for different vector field using "WeightedRanker" reranker #31368

Closed
1 task done
binbinlv opened this issue Mar 18, 2024 · 3 comments
Assignees
Labels
kind/bug Issues or changes related a bug kind/improvement Changes related to something improve, likes ut and code refactor stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

binbinlv commented Mar 18, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-latest
- Deployment mode(standalone or cluster): both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc57
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The rerank effect may be bad when setting different metric type for different vector field using "WeightedRanker" reranker:

for example:

when setting metric type "COSINE" for float vector field A, and setting "L2" for float vector field B, then now it will choose the metric type for the first vector field in schema as the sorted way for the hybrid search result.

But this way seems not reflecting the similarity after reranking, because "COSINE" is the larger the similar, and "L2" is the smaller the similar.

And another point is that it seems not very meaningful to weighted sum of two values in huge range difference, just like the range of "COSINE" is "-1 ~ 1", and "L2" is "-∞ ~ +∞“.

Expected Behavior

A better WeightedRanker algorithm design which could reflect the real similarity

Steps To Reproduce

  1. create a collection with 3 vector fields
  2. insert data
  3. create index in "COSINE", "L2", "IP" metric type in the 3 vector fields
  4. load
  5. hybrid search using "WeightedRanker" reranker
    @pytest.mark.parametrize("primary_field", [ct.default_int64_field_name])
    def test_hybrid_search_different_metric_type_each_field(self, primary_field, dim, auto_id, is_flush,
                                                 enable_dynamic_field, metric_type):
        """
        target: test hybrid search for fields with different metric type
        method: create connection, collection, insert and search
        expected: hybrid search successfully with limit(topK)
        """
        # 1. initialize collection with data
        collection_w, _, _, insert_ids, time_stamp = \
            self.init_collection_general(prefix, True, auto_id=auto_id, dim=dim, is_flush=is_flush, is_index=False,
                                         primary_field=primary_field,
                                         enable_dynamic_field=False, multiple_dim_array=[dim, dim])[0:5]
        # 2. extract vector field name
        vector_name_list = cf.extract_vector_field_name_list(collection_w)
        vector_name_list.append(ct.default_float_vec_field_name)
        log.debug(vector_name_list)
        flat_index = {"index_type": "FLAT", "params": {}, "metric_type": "L2"}
        collection_w.create_index(vector_name_list[0], flat_index)
        flat_index = {"index_type": "FLAT", "params": {}, "metric_type": "IP"}
        collection_w.create_index(vector_name_list[1], flat_index)
        flat_index = {"index_type": "FLAT", "params": {}, "metric_type": "COSINE"}
        collection_w.create_index(vector_name_list[2], flat_index)
        collection_w.load()
        # 3. prepare search params
        req_list = []
        search_param = {
            "data": [[random.random() for _ in range(dim)] for _ in range(1)],
            "anns_field": vector_name_list[0],
            "param": {"metric_type": "L2", "offset": 0},
            "limit": default_limit,
            "expr": "int64 > 0"}
        req = AnnSearchRequest(**search_param)
        req_list.append(req)
        search_param = {
            "data": [[random.random() for _ in range(dim)] for _ in range(1)],
            "anns_field": vector_name_list[1],
            "param": {"metric_type": "IP", "offset": 0},
            "limit": default_limit,
            "expr": "int64 > 0"}
        req = AnnSearchRequest(**search_param)
        req_list.append(req)
        search_param = {
            "data": [[random.random() for _ in range(dim)] for _ in range(1)],
            "anns_field": vector_name_list[2],
            "param": {"metric_type": "COSINE", "offset": 0},
            "limit": default_limit,
            "expr": "int64 > 0"}
        req = AnnSearchRequest(**search_param)
        req_list.append(req)
        # 4. hybrid search
        hybrid_search = collection_w.hybrid_search(req_list, WeightedRanker(0.1, 0.9, 1), default_limit,
                                                   check_task=CheckTasks.check_search_results,
                                                   check_items={"nq": 1,
                                                                "ids": insert_ids,
                                                                "limit": default_limit})[0]
        log.debug(hybrid_search[0].ids)
        log.debug(hybrid_search[0].distances)

Milvus Log

No response

Anything else?

No response

@binbinlv binbinlv added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 18, 2024
@binbinlv binbinlv added this to the 2.4.0 milestone Mar 18, 2024
@yanliang567 yanliang567 changed the title [Bug]: [hybrid_search] The rerank effect may be bad when setting different metric type for different vector field using "WeightedRanker" reranker [Bug]: [hybrid_search] The rerank effect need more improvements when setting different metric type for different vector field using "WeightedRanker" reranker Mar 18, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/improvement Changes related to something improve, likes ut and code refactor and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 18, 2024
@yanliang567
Copy link
Contributor

/unassign

@binbinlv binbinlv changed the title [Bug]: [hybrid_search] The rerank effect need more improvements when setting different metric type for different vector field using "WeightedRanker" reranker [Bug]: [hybrid_search] The rerank effect needs more improvements when setting different metric type for different vector field using "WeightedRanker" reranker Mar 18, 2024
@czs007
Copy link
Collaborator

czs007 commented Mar 20, 2024

working on it

sre-ci-robot pushed a commit that referenced this issue Apr 9, 2024
sre-ci-robot pushed a commit that referenced this issue Apr 9, 2024
sre-ci-robot pushed a commit that referenced this issue Apr 16, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.0, 2.4.1 Apr 18, 2024
yellow-shine pushed a commit to yellow-shine/milvus that referenced this issue Apr 18, 2024
sre-ci-robot pushed a commit that referenced this issue Apr 29, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.1, 2.4.2 May 7, 2024
Copy link

stale bot commented Jun 11, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Jun 11, 2024
@stale stale bot closed this as completed Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug kind/improvement Changes related to something improve, likes ut and code refactor stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants