-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts #1068
Conversation
Update all subiterators of BitSetConjunctionDISI to NO_MORE_DOCS when lead iterator's current doc is greater than the length of the smallest bitset in the bitset iterators. Signed-off-by: John Mazanec <[email protected]>
Thanks, looks reasonable to me, please add an entry to CHANGES.txt! |
Signed-off-by: John Mazanec <[email protected]>
@zhaih Thanks! Added entry for 9.5 release. |
The fact that it should be legal for |
} | ||
for (BitSetIterator iterator : bitSetIterators) { | ||
iterator.setDocId(NO_MORE_DOCS); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if
statement makes sense to me, but I'm curious how you managed to hit this case. This suggests that we create BitSets whose size is not maxDoc
, do you know where this happens?
The for
loop should be unnecessary, there is no guarantee that all sub iterators advance to NO_MORE_DOCS
. If this causes problems, then it means we have another bug somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if statement makes sense to me, but I'm curious how you managed to hit this case. This suggests that we create BitSets whose size is not maxDoc, do you know where this happens?
I think I might be misunderstanding the question. Each bitsetiterator could have a different length of bitset, potentially as an optimization (minLength I think suggests this is expected). If a bitsetiterator's top match is 10 and there are 1M docs in the index, I think there was no reason to store 1M bits - the bitsetiterator can just exhaust after 10.
The for loop should be unnecessary, there is no guarantee that all sub iterators advance to NO_MORE_DOCS. If this causes problems, then it means we have another bug somewhere else?
Agree this is probably unnecessary. I added it to ensure that this statement holds: "Requires that all of its sub-iterators must be on the same document all the time."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. For this, we created a DocIdSetIterator using DocIdSetBuilder. For maxDoc, it was less than the number of docs in the index. This led to the length of the bitsets being different.
Is it valid for an iterator to advance outside of the ConjunctionDISI? I cant find anywhere that prevents this, but I was under the assumption it should be invalid to do this.
In general I was trying to keep all sub-iterators on the same doc all the time. In terms of importance, I guess it depends if the iterators can be used outside the ConjunctionDISI. |
Thank you for your comments, I think I understand the bug now. I think a better description of the bug is that
It is invalid indeed. The goal of the assertion that all iterators are on the same doc is to catch such issues.
I have a preference for not advancing other iterators, because it should not be necessary, and if it is then this means we have another bug somewhere else. The I would suggest to no longer advance other iterators, and change the test to make sure that |
Removes updating of sub-iterators on BitSetConjDISI exhaust. Because it is invalid to use a sub-iterator outside of the the ConjDISI, it does not make sense to update these iterators when the lead has exhausted. Updates test and CHANGES description to reflect this change. Signed-off-by: John Mazanec <[email protected]>
b40554b
to
220089e
Compare
@jpountz That makes sense. If the iterators are not valid outside of the ConjDISI, it doesnt make sense to advance them once the leader is exhausted. I updated test and changes to reflect this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change and the test look great. Thank you!
…rator exhausts. (#1068) Signed-off-by: John Mazanec <[email protected]>
…rator exhausts. (#1068) Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 FYI I backported this change to 9.4, so it should be in the next release. |
@jmazanec15 Maybe one thing I'd like to double check: what is the query that created a DocIdSetBuilder with a length that is less than |
* main: (324 commits) Mute TestKnnVectorQuery#testFilterWithSameScore while we work on a fix Guard FieldExistsQuery against null pointers (apache#11794) Fix handling of ghost fields in string sorts. (apache#11792) LUCENE-10365 Wizard changes contributed from Solr (apache#591) GitHub Workflows security hardening (apache#11789) Improve tessellator performance by delaying calls to the method #isIntersectingPolygon (apache#11786) update DOAP and releaseWizard to reflect migration to github (apache#11747) Diversity check bugfix (apache#11781) Fix rare bug in TestKnnVectorQuery when we have multiple segments GITHUB#11778: Add detailed part-of-speech tag for particle and ending on Nori (apache#11779) LUCENE-10674: Move changes entry to 9.4. apacheGH-11172: remove WindowsDirectory and native subproject. (apache#11774) LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts. (apache#1068) Removed duplicate check in SpanGradientFormatter (apache#11762) Fix integer overflow in tests. GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children (apache#11764) Retry gradle wrapper download on http 500 and 503. (apache#11766) Fix a typo affecting Luke (apache#11763) Fix IntervalBuilder.NO_INTERVALS docId when unpositioned (apache#11760) LUCENE-10592 Better estimate memory for HNSW graph (apache#11743) ...
Description (or a Jira issue link if you have one)
Update all subiterators of BitSetConjunctionDISI to NO_MORE_DOCS when lead iterator's current doc is greater than the length of the smallest bitset in the bitset iterators. Add test to verify sub iterators exhaust when this case happens.
LUCENE-10674