Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts #1068

Merged
merged 5 commits into from
Sep 15, 2022

Conversation

jmazanec15
Copy link
Contributor

Description (or a Jira issue link if you have one)

Update all subiterators of BitSetConjunctionDISI to NO_MORE_DOCS when lead iterator's current doc is greater than the length of the smallest bitset in the bitset iterators. Add test to verify sub iterators exhaust when this case happens.

LUCENE-10674

Update all subiterators of BitSetConjunctionDISI to NO_MORE_DOCS when
lead iterator's current doc is greater than the length of the smallest
bitset in the bitset iterators.

Signed-off-by: John Mazanec <[email protected]>
@zhaih
Copy link
Contributor

zhaih commented Sep 6, 2022

Thanks, looks reasonable to me, please add an entry to CHANGES.txt!

@jmazanec15
Copy link
Contributor Author

@zhaih Thanks! Added entry for 9.5 release.

@jpountz
Copy link
Contributor

jpountz commented Sep 8, 2022

The fact that it should be legal for ConjunctionDISI to return NO_MORE_DOCS when the lead iterator advances to NO_MORE_DOCS without advancing other iterators makes me wonder if this change is hiding another bug. Do you understand why it's important for clauses under the ConjuctionDISI to advance to NO_MORE_DOCS?

}
for (BitSetIterator iterator : bitSetIterators) {
iterator.setDocId(NO_MORE_DOCS);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement makes sense to me, but I'm curious how you managed to hit this case. This suggests that we create BitSets whose size is not maxDoc, do you know where this happens?

The for loop should be unnecessary, there is no guarantee that all sub iterators advance to NO_MORE_DOCS. If this causes problems, then it means we have another bug somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement makes sense to me, but I'm curious how you managed to hit this case. This suggests that we create BitSets whose size is not maxDoc, do you know where this happens?

I think I might be misunderstanding the question. Each bitsetiterator could have a different length of bitset, potentially as an optimization (minLength I think suggests this is expected). If a bitsetiterator's top match is 10 and there are 1M docs in the index, I think there was no reason to store 1M bits - the bitsetiterator can just exhaust after 10.

The for loop should be unnecessary, there is no guarantee that all sub iterators advance to NO_MORE_DOCS. If this causes problems, then it means we have another bug somewhere else?

Agree this is probably unnecessary. I added it to ensure that this statement holds: "Requires that all of its sub-iterators must be on the same document all the time."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. For this, we created a DocIdSetIterator using DocIdSetBuilder. For maxDoc, it was less than the number of docs in the index. This led to the length of the bitsets being different.

@jmazanec15
Copy link
Contributor Author

The fact that it should be legal for ConjunctionDISI to return NO_MORE_DOCS when the lead iterator advances to NO_MORE_DOCS without advancing other iterators makes me wonder if this change is hiding another bug.

Is it valid for an iterator to advance outside of the ConjunctionDISI? I cant find anywhere that prevents this, but I was under the assumption it should be invalid to do this.

Do you understand why it's important for clauses under the ConjuctionDISI to advance to NO_MORE_DOCS?

In general I was trying to keep all sub-iterators on the same doc all the time. In terms of importance, I guess it depends if the iterators can be used outside the ConjunctionDISI.

@jmazanec15 jmazanec15 requested a review from jpountz September 8, 2022 23:20
@jpountz
Copy link
Contributor

jpountz commented Sep 9, 2022

Thank you for your comments, I think I understand the bug now. I think a better description of the bug is that BitSetConjunctionDISI#docID() doesn't honor its contract that it must return NO_MORE_DOCS when the iterator is exhausted. And this issue may only occur if any of the bitsets involved in the conjunction have a length that is less than maxDoc, which is not typical. This would explain why we haven't seen this bug earlier.

Is it valid for an iterator to advance outside of the ConjunctionDISI? I cant find anywhere that prevents this, but I was under the assumption it should be invalid to do this.

It is invalid indeed. The goal of the assertion that all iterators are on the same doc is to catch such issues.

I was trying to keep all sub-iterators on the same doc all the time.

I have a preference for not advancing other iterators, because it should not be necessary, and if it is then this means we have another bug somewhere else. The ConjunctionDISI that doesn't optimize for bitsets only advances other iterators when reaching NO_MORE_DOCS because avoiding to do this would require introducing more conditionals, which would have overhead. But it isn't necessary.

I would suggest to no longer advance other iterators, and change the test to make sure that docID() returns NO_MORE_DOCS when the iterator is exhausted, instead of checking if all sub iterators are on the same doc ID?

Removes updating of sub-iterators on BitSetConjDISI exhaust. Because it
is invalid to use a sub-iterator outside of the the ConjDISI, it does
not make sense to update these iterators when the lead has exhausted.
Updates test and CHANGES description to reflect this change.

Signed-off-by: John Mazanec <[email protected]>
@jmazanec15
Copy link
Contributor Author

@jpountz That makes sense. If the iterators are not valid outside of the ConjDISI, it doesnt make sense to advance them once the leader is exhausted. I updated test and changes to reflect this.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change and the test look great. Thank you!

@jpountz jpountz merged commit 0587844 into apache:main Sep 15, 2022
@jpountz jpountz changed the title LUCENE-10674: Update subiterators when BitSetConjDISI exhausts LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts Sep 15, 2022
jpountz pushed a commit that referenced this pull request Sep 15, 2022
jpountz pushed a commit that referenced this pull request Sep 16, 2022
@jpountz jpountz added this to the 9.4.0 milestone Sep 16, 2022
@jpountz
Copy link
Contributor

jpountz commented Sep 16, 2022

@jmazanec15 FYI I backported this change to 9.4, so it should be in the next release.

@jpountz
Copy link
Contributor

jpountz commented Sep 16, 2022

@jmazanec15 Maybe one thing I'd like to double check: what is the query that created a DocIdSetBuilder with a length that is less than maxDoc? Is it a Lucene Query or a custom query of your own?

wjp719 added a commit to wjp719/lucene that referenced this pull request Sep 21, 2022
* main: (324 commits)
  Mute TestKnnVectorQuery#testFilterWithSameScore while we work on a fix
  Guard FieldExistsQuery against null pointers (apache#11794)
  Fix handling of ghost fields in string sorts. (apache#11792)
  LUCENE-10365 Wizard changes contributed from Solr (apache#591)
  GitHub Workflows security hardening (apache#11789)
  Improve tessellator performance by delaying calls to the method #isIntersectingPolygon (apache#11786)
  update DOAP and releaseWizard to reflect migration to github (apache#11747)
  Diversity check bugfix (apache#11781)
  Fix rare bug in TestKnnVectorQuery when we have multiple segments
  GITHUB#11778: Add detailed part-of-speech tag for particle and ending on Nori (apache#11779)
  LUCENE-10674: Move changes entry to 9.4.
  apacheGH-11172: remove WindowsDirectory and native subproject. (apache#11774)
  LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts. (apache#1068)
  Removed duplicate check in SpanGradientFormatter (apache#11762)
  Fix integer overflow in tests.
  GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children (apache#11764)
  Retry gradle wrapper download on http 500 and 503. (apache#11766)
  Fix a typo affecting Luke (apache#11763)
  Fix IntervalBuilder.NO_INTERVALS docId when unpositioned (apache#11760)
  LUCENE-10592 Better estimate memory for HNSW graph (apache#11743)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants