Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inaccuracy in IVF-Flat Search Results with Large Number of Queries #1756

Closed
tarang-jain opened this issue Aug 19, 2023 · 3 comments · Fixed by #1764
Closed

[BUG] Inaccuracy in IVF-Flat Search Results with Large Number of Queries #1756

tarang-jain opened this issue Aug 19, 2023 · 3 comments · Fixed by #1764
Assignees
Labels
bug Something isn't working FAISS Vector Search

Comments

@tarang-jain
Copy link
Contributor

Describe the bug
IVF-Flat Search gives inconsistent results with different batch sizes when the number of queries is very large. This is a blocker in the FAISS IVF-Flat integration. cc @achirkin @cjnolet @tfeher

Steps/Code to reproduce bug
Context: With the faiss integration work under way, the following test from faiss fails: LargeBatch
This test runs 100,000 search queries on an IVF-Flat index and compares the resulting indices and distances with a FAISS CPU IVF-Flat Index.

Expected behavior
I tried modifying the batch size by changing this line to
const uint32_t max_queries = std::min<uint32_t>(n_queries, 10000);
and now the test passes. I tried other values such that they are lesser than 32768, to come to the conclusion that whenever the max_queries defined here is greater than the kMaxGridY, the results are incorrect and when it is lesser than kMaxGridY, the FAISS test passes.
In other words, whenever the kernel here runs more than once, the test fails.

@tarang-jain tarang-jain added bug Something isn't working FAISS Vector Search labels Aug 19, 2023
@achirkin
Copy link
Contributor

Thanks for the report. Apparently, we do test this use case in raft. I've tried to add a dozen of other parameter combinations, but consistently get the recall of 1.0; that is, I haven't been able to reproduce this in raft.
Could you please report the indexing and search parameters you use in faiss, so that we can try to reproduce it here?
Could you also elaborate to which extent the results are inconsistent? Is the recall just a bit smaller than 1.0, or are the results completely wrong (e.g. recall < 0.5)?

So far, the only hypothesis that comes to my mind is that maybe it is a concurrency issue? Maybe when raft runs more than one batch iteration, the submitted gpu work piles up and does not finish before the results are submitted for evaluation? Do you set the cuda stream in raft::resources (raft_handle) to be the same as the stream faiss uses under the hood? Or you do synchronize between them / device?

@tarang-jain
Copy link
Contributor Author

Thanks for pointing out the large query batch test.
Steps to reproduce the inconsistent results:

  1. Add the following input to the gtests:
    {100000, 8712, 3, 10, 51, 66, raft::distance::DistanceType::L2Expanded, false}
  2. Set the kmeans_trainset_fraction to 1.0 here
  3. Run the test for float inputs. The recall should be ~0.83
  4. Now change this line to const uint32_t max_queries = std::min<uint32_t>(n_queries, 32000);
  5. Now run the same test, and notice that the recall is much higher now (=1.000).

This set of inputs is the same as the one being run in FAISS.

@achirkin
Copy link
Contributor

achirkin commented Aug 23, 2023

Thanks for very helpful reproducer! The PR is ready.

rapids-bot bot pushed a commit that referenced this issue Aug 23, 2023
Fix the cluster probes (coarse_index) not being advanced when batching.

Thanks @tarang-jain for the precise reproducer.

Closes: #1756

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1764
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working FAISS Vector Search
Projects
Development

Successfully merging a pull request may close this issue.

3 participants