Change initial size of DocIdSetBuilder #502

jmazanec15 · 2022-08-08T23:28:37Z

Description

Currently, we grow the size of the DocIDSetBuilder to the maximum id in the results returned by the query. The grow function, however, is intended to take the number of docs the iterator should have, not the maximum doc id.

Internally, this method has logic to determine whether the iterator should be a sparse set of doc ids, or a dense bit set, where 1 indicates whether a docid is set or not. This can cause problems where the wrong iterator type is created. For instance, assume our results are [200,000,000]. Num docs = 1, but max doc = 200,000,000. This would then cause a bitset of 200,000,000 entries to be created for a single doc. Instead, if we set the length to 1, the iterator would have a single int in an int array.

This change will allow lucene to pick the best iterator for our use case.

Issues Resolved

#500

Check List

Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: John Mazanec <[email protected]>

codecov-commenter · 2022-08-08T23:43:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.04%. Comparing base (641614d) to head (c5c23e4).
Report is 372 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##               main     #502   +/-   ##
=========================================
  Coverage     84.04%   84.04%           
  Complexity     1019     1019           
=========================================
  Files           146      146           
  Lines          4188     4188           
  Branches        373      373           
=========================================
  Hits           3520     3520           
  Misses          492      492           
  Partials        176      176

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

martin-gaievski · 2022-08-09T16:04:37Z

Is it possible to add a unit test to check for the iterator type? We can check the exact implementation of the BulkAdder returned by grow() method

jmazanec15 · 2022-08-09T16:31:16Z

Is it possible to add a unit test to check for the iterator type? We can check the exact implementation of the BulkAdder returned by grow() method

What would the purpose of this test be? I think the point of switching is so that we can defer to Lucene to give us the correct iterator.

martin-gaievski · 2022-08-09T16:41:54Z

Is it possible to add a unit test to check for the iterator type? We can check the exact implementation of the BulkAdder returned by grow() method

What would the purpose of this test be? I think the point of switching is so that we can defer to Lucene to give us the correct iterator.

I think the main purpose is to make sure which exact implementation is returned by Lucene and if that matches our expectations. Lucene has some internal logic behind identifying the exact class for adder, so let's say in future that change it - we can catch that earlier in tests and act accordingly.

jmazanec15 · 2022-08-09T20:32:23Z

Is it possible to add a unit test to check for the iterator type? We can check the exact implementation of the BulkAdder returned by grow() method

What would the purpose of this test be? I think the point of switching is so that we can defer to Lucene to give us the correct iterator.

I think the main purpose is to make sure which exact implementation is returned by Lucene and if that matches our expectations. Lucene has some internal logic behind identifying the exact class for adder, so let's say in future that change it - we can catch that earlier in tests and act accordingly.

I think unit tests would need to test DocIdSetBuilder. Given that we dont implement this, I am not sure it makes sense to add a unit test testing this functionality in our plugin. I think if we want more control over which iterator is built, we would need to implement our own DocIdSetIteratorBuilder and test that. However, here I just want to defer to Lucene's DocIdSetBuilder to make decisions on which iterator to build. In general, this functionality will not be unique to k-NN.

Changes initial size of the docidsetbuilder used for iterating over results for k-NN queries. Originally, it was set to the maximum docid. This changes it to be the number of docs returned. Signed-off-by: John Mazanec <[email protected]> (cherry picked from commit 586958e)

Changes initial size of the docidsetbuilder used for iterating over results for k-NN queries. Originally, it was set to the maximum docid. This changes it to be the number of docs returned. Signed-off-by: John Mazanec <[email protected]> (cherry picked from commit 586958e) Co-authored-by: John Mazanec <[email protected]>

Change size of doc id set iterator

c5c23e4

Signed-off-by: John Mazanec <[email protected]>

jmazanec15 requested a review from a team August 8, 2022 23:28

jmazanec15 added Enhancements Increases software capabilities beyond original client specifications backport 2.x labels Aug 8, 2022

jmazanec15 requested review from navneet1v and martin-gaievski August 8, 2022 23:52

navneet1v approved these changes Aug 10, 2022

View reviewed changes

martin-gaievski approved these changes Aug 11, 2022

View reviewed changes

jmazanec15 merged commit 586958e into opensearch-project:main Aug 11, 2022

opensearch-trigger-bot bot mentioned this pull request Aug 11, 2022

[Backport 2.x] Change initial size of DocIdSetBuilder #511

Merged

jmazanec15 mentioned this pull request Aug 18, 2022

[BUG] Sporadic empty inner hits on nested kNN search #466

Closed

navneet1v added the 2.3.0 label Sep 7, 2022

heemin32 added v2.3.0 'Issues and PRs related to version v2.3.0' and removed 2.3.0 labels Nov 2, 2022

martin-gaievski added backport 1.x backport 1.3 Backports PRs to 1.3 branch labels Feb 7, 2023

opensearch-trigger-bot bot mentioned this pull request Feb 7, 2023

[Backport 1.3] Change initial size of DocIdSetBuilder #752

Merged

opensearch-trigger-bot bot mentioned this pull request Feb 7, 2023

[Backport 1.x] Change initial size of DocIdSetBuilder #753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change initial size of DocIdSetBuilder #502

Change initial size of DocIdSetBuilder #502

jmazanec15 commented Aug 8, 2022

codecov-commenter commented Aug 8, 2022 •

edited by codecov bot

Loading

martin-gaievski commented Aug 9, 2022

jmazanec15 commented Aug 9, 2022

martin-gaievski commented Aug 9, 2022

jmazanec15 commented Aug 9, 2022

Change initial size of DocIdSetBuilder #502

Change initial size of DocIdSetBuilder #502

Conversation

jmazanec15 commented Aug 8, 2022

Description

Issues Resolved

Check List

codecov-commenter commented Aug 8, 2022 • edited by codecov bot Loading

Codecov Report

martin-gaievski commented Aug 9, 2022

jmazanec15 commented Aug 9, 2022

martin-gaievski commented Aug 9, 2022

jmazanec15 commented Aug 9, 2022

codecov-commenter commented Aug 8, 2022 •

edited by codecov bot

Loading