-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grow docidsetbuilder to number of results instead of max doc id returned #500
Comments
Experimental setupI ran performance tests on the sift data set which has 1M 128D vectors with a 10K query set. I tested faiss HNSW with m=16, ef_search=64 and ef_construction=64. I setup cluster with 2.1 with 3 c5.xlarge leaders and 1 r5.2xlarge data nodes. The control cluster came from the official 2.1 release. The test cluster was based off of https://github.com/jmazanec15/k-NN-1/tree/change-docsize-iterator. I ran 3 iterations of each. I used 1 primary shard and 0 replicas. Benchmark code can be found in https://github.com/opensearch-project/k-NN/tree/main/benchmarks/osb. ResultsControlExperiment 1
Experiment 2
Experiment 3
TestExperiment 1
Experiment 2
Experiment 3
ConclusionPerformance is similar on the given data set. |
Seems that data for experiments 2 and 3 do not show much of improvement, and rather worse performance for some metrics. Are these results accurate? |
@martin-gaievski For this experiment, Im not sure we will see too much improvement for a data set this small. Overall, the tests above show that the change does not cause any major regressions. We will need to see on a larger data set to see more subtle improvements. I thought about running a 1B test, but I think the size of the change probably doesnt warrant it. I would like to group a few changes together before running 1B. |
When creating results, we create a DocIdSetBuilder and grow the results to the max doc id returned in the query.
For instance, if we had 3 results: [2, 2000, 20,000,000], the DocIdSetBuilder would be grown to 20,000,000 docs. This would then cause Lucene to use a bitset instead of a sorted array of integers for the results.
In the case above, the iterator would take 20,000,000 bits (20 MB) instead of 3 * 32 bits/int = 96 bits.
Instead of calling DocIdSetBuilder.grow(maxDoc), we should call DocIdSetBuilder.grow(results.length). Then, lucene can internally handle the optimization of switching from sparse to dense.
We should benchmark this change as well to see if there are any improvements/degradations.
The text was updated successfully, but these errors were encountered: