Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a postCollect hook to LeafCollector #12375

Closed
jpountz opened this issue Jun 15, 2023 · 2 comments · Fixed by #12380
Closed

Add a postCollect hook to LeafCollector #12375

jpountz opened this issue Jun 15, 2023 · 2 comments · Fixed by #12380

Comments

@jpountz
Copy link
Contributor

jpountz commented Jun 15, 2023

Description

It is a common need to run some logic after a segment has been collected. Even though, I can't find previous instances of this discussion I'm pretty sure that this has been raised several times in the past, and the answer was essentially that this logic can easily be implemented on top of Lucene. One good example of this is our own FacetsCollector, which collects the set of matching docs per segment: getLeafCollector appends the set of doc IDs that were collected on the previous segment to the set, and getMatchingDocs takes care of the last segment, since getLeafCollector doesn't get called anymore after the last segment has been collected.

However, this approach is not perfect. If you are leveraging Lucene's concurrent search capabilities, this forces the post collection logic to run in the current thread for at least one segment per slice, instead of using the executor. This is a missed opportunity for search concurrency, since post collection logic is not always cheap. For instance, in the case of FacetsCollector it needs to run DocIdSetBuilder.build() which may need to sort a large array of doc IDs. Having a LeafCollector.postCollect() API or something along these lines would help address this issue, as postCollect() would get called on the IndexSearcher's executor.

I looked at our collectors to get a sense of how many of our collectors could take advantage of a postCollect() hook and found the following ones:

  • org.apache.lucene.facet.FacetsCollector
  • org.apache.lucene.search.grouping.BlockGroupingCollector
  • org.apache.lucene.search.grouping.TermGroupFacetCollector
  • org.apache.lucene.search.suggest.document.TopSuggestDocsCollector
  • org.apache.lucene.search.CachingCollector
@msokolov
Copy link
Contributor

+1 to this. We've had to implement a finish() in some custom collectors for handling grouping IIRC it makes sense to make it standard

jpountz added a commit to jpountz/lucene that referenced this issue Jun 21, 2023
This adds `LeafCollector#finish` as a per-segment post-collection hook. While
it was already possible to do this sort of things on top of the collector API
before, a downside is that the last leaf would need to be post-collected in the
current thread instead of using the executor, which is a missed opportunity for
making queries concurrent.

Closes apache#12375
@gsmiller
Copy link
Contributor

+1. I'm also in favor. I like being able to do this in-thread for concurrent search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants