Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [sparse] searching iterator on sparse vectors returns duplicate results #36174

Closed
1 task done
yanliang567 opened this issue Sep 11, 2024 · 6 comments
Closed
1 task done
Assignees
Labels
2.5-features kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@yanliang567
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20240911-42eef490-amd64
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus-2.5.0rc79

Current Behavior

if searching iterator on sparse vectors, it returns duplicate results

Expected Behavior

no duplicate results returns in search iterator, like searching on dense vectors

Steps To Reproduce

1. create a collection with sparse vectors
2. insert 4000 data
3. build index and load
4. search iterator with batch_size=10(or 100)

Milvus Log

No response

Anything else?

run the test below:

@pytest.mark.tags(CaseLabel.L2)
    @pytest.mark.parametrize("index", ct.all_index_types[9:11])
    def test_sparse_vector_search_iterator(self, index):
        """
        target: create sparse vectors and search iterator
        method: create sparse vectors and search iterator
        expected: normal search
        """
        self._connect()
        c_name = cf.gen_unique_str(prefix)
        schema = cf.gen_default_sparse_schema()
        collection_w = self.init_collection_wrap(c_name, schema=schema)
        data = cf.gen_default_list_sparse_data(nb=4000)
        collection_w.insert(data)
        params = cf.get_index_params_params(index)
        index_params = {"index_type": index, "metric_type": "IP", "params": params}
        collection_w.create_index(ct.default_sparse_vec_field_name, index_params, index_name=index)

        collection_w.load()
        batch_size = 100
        collection_w.search_iterator(data[-1][0:1], ct.default_sparse_vec_field_name,
                                     ct.default_sparse_search_params, limit=500, batch_size=batch_size,
                                     check_task=CheckTasks.check_search_iterator,
                                     check_items={"batch_size": batch_size})
@yanliang567 yanliang567 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 11, 2024
@yanliang567 yanliang567 self-assigned this Sep 11, 2024
@yanliang567
Copy link
Contributor Author

/assign @zhengbuqian
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 11, 2024
@yanliang567 yanliang567 added this to the 2.5.0 milestone Sep 11, 2024
@PwzXxm
Copy link
Contributor

PwzXxm commented Nov 12, 2024

drop_ratio does not affect on the refine process in KNN call, but affects in compute_all_distances() where RangeSearch calls. The latter distance values are less than the former, which causes the ids to overlap, i.e.,
KNN get (id, dist): (1, 0.5), (2, 0.4), (3, 0.3) and the search iterator uses 0.3 as the bound for the following RangeSearch call, but as the drop_ratio applies to RangeSearch but not KNN, (3, 0.28) may appear in the first RangeSearch call.

@zhengbuqian
Copy link
Collaborator

zilliztech/knowhere#944 to make iterator to return raw distance instead of quantized distance

@zhengbuqian
Copy link
Collaborator

/unassign

/assign @yanliang567

please verify after updating milvus to use the latest knowhere

@yanliang567
Copy link
Contributor Author

which build should I use to verify? @zhengbuqian

@yanliang567
Copy link
Contributor Author

verified on master-20241118-257ecab8-amd64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.5-features kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants