Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change initial size of DocIdSetBuilder #502

Merged
merged 1 commit into from
Aug 11, 2022

Conversation

jmazanec15
Copy link
Member

Description

Currently, we grow the size of the DocIDSetBuilder to the maximum id in the results returned by the query. The grow function, however, is intended to take the number of docs the iterator should have, not the maximum doc id.

Internally, this method has logic to determine whether the iterator should be a sparse set of doc ids, or a dense bit set, where 1 indicates whether a docid is set or not. This can cause problems where the wrong iterator type is created. For instance, assume our results are [200,000,000]. Num docs = 1, but max doc = 200,000,000. This would then cause a bitset of 200,000,000 entries to be created for a single doc. Instead, if we set the length to 1, the iterator would have a single int in an int array.

This change will allow lucene to pick the best iterator for our use case.

Issues Resolved

#500

Check List

  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 requested a review from a team August 8, 2022 23:28
@jmazanec15 jmazanec15 added Enhancements Increases software capabilities beyond original client specifications backport 2.x labels Aug 8, 2022
@codecov-commenter
Copy link

codecov-commenter commented Aug 8, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.04%. Comparing base (641614d) to head (c5c23e4).
Report is 372 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main     #502   +/-   ##
=========================================
  Coverage     84.04%   84.04%           
  Complexity     1019     1019           
=========================================
  Files           146      146           
  Lines          4188     4188           
  Branches        373      373           
=========================================
  Hits           3520     3520           
  Misses          492      492           
  Partials        176      176           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@martin-gaievski
Copy link
Member

Is it possible to add a unit test to check for the iterator type? We can check the exact implementation of the BulkAdder returned by grow() method

@jmazanec15
Copy link
Member Author

Is it possible to add a unit test to check for the iterator type? We can check the exact implementation of the BulkAdder returned by grow() method

What would the purpose of this test be? I think the point of switching is so that we can defer to Lucene to give us the correct iterator.

@martin-gaievski
Copy link
Member

Is it possible to add a unit test to check for the iterator type? We can check the exact implementation of the BulkAdder returned by grow() method

What would the purpose of this test be? I think the point of switching is so that we can defer to Lucene to give us the correct iterator.

I think the main purpose is to make sure which exact implementation is returned by Lucene and if that matches our expectations. Lucene has some internal logic behind identifying the exact class for adder, so let's say in future that change it - we can catch that earlier in tests and act accordingly.

@jmazanec15
Copy link
Member Author

Is it possible to add a unit test to check for the iterator type? We can check the exact implementation of the BulkAdder returned by grow() method

What would the purpose of this test be? I think the point of switching is so that we can defer to Lucene to give us the correct iterator.

I think the main purpose is to make sure which exact implementation is returned by Lucene and if that matches our expectations. Lucene has some internal logic behind identifying the exact class for adder, so let's say in future that change it - we can catch that earlier in tests and act accordingly.

I think unit tests would need to test DocIdSetBuilder. Given that we dont implement this, I am not sure it makes sense to add a unit test testing this functionality in our plugin. I think if we want more control over which iterator is built, we would need to implement our own DocIdSetIteratorBuilder and test that. However, here I just want to defer to Lucene's DocIdSetBuilder to make decisions on which iterator to build. In general, this functionality will not be unique to k-NN.

@jmazanec15 jmazanec15 merged commit 586958e into opensearch-project:main Aug 11, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 11, 2022
Changes initial size of the docidsetbuilder used for iterating over results for k-NN queries. Originally, it was set to the maximum docid. This changes it to be the number of docs returned.

Signed-off-by: John Mazanec <[email protected]>
(cherry picked from commit 586958e)
@navneet1v navneet1v added the 2.3.0 label Sep 7, 2022
@heemin32 heemin32 added v2.3.0 'Issues and PRs related to version v2.3.0' and removed 2.3.0 labels Nov 2, 2022
@martin-gaievski martin-gaievski added backport 1.x backport 1.3 Backports PRs to 1.3 branch labels Feb 7, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 7, 2023
Changes initial size of the docidsetbuilder used for iterating over results for k-NN queries. Originally, it was set to the maximum docid. This changes it to be the number of docs returned.

Signed-off-by: John Mazanec <[email protected]>
(cherry picked from commit 586958e)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 7, 2023
Changes initial size of the docidsetbuilder used for iterating over results for k-NN queries. Originally, it was set to the maximum docid. This changes it to be the number of docs returned.

Signed-off-by: John Mazanec <[email protected]>
(cherry picked from commit 586958e)
martin-gaievski pushed a commit that referenced this pull request Feb 7, 2023
Changes initial size of the docidsetbuilder used for iterating over results for k-NN queries. Originally, it was set to the maximum docid. This changes it to be the number of docs returned.

Signed-off-by: John Mazanec <[email protected]>
(cherry picked from commit 586958e)

Co-authored-by: John Mazanec <[email protected]>
martin-gaievski pushed a commit that referenced this pull request Feb 7, 2023
Changes initial size of the docidsetbuilder used for iterating over results for k-NN queries. Originally, it was set to the maximum docid. This changes it to be the number of docs returned.

Signed-off-by: John Mazanec <[email protected]>
(cherry picked from commit 586958e)

Co-authored-by: John Mazanec <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.x backport 1.3 Backports PRs to 1.3 branch backport 2.x Enhancements Increases software capabilities beyond original client specifications v2.3.0 'Issues and PRs related to version v2.3.0'
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants