Add a postCollect hook to LeafCollector #12375

jpountz · 2023-06-15T14:17:42Z

Description

It is a common need to run some logic after a segment has been collected. Even though, I can't find previous instances of this discussion I'm pretty sure that this has been raised several times in the past, and the answer was essentially that this logic can easily be implemented on top of Lucene. One good example of this is our own FacetsCollector, which collects the set of matching docs per segment: getLeafCollector appends the set of doc IDs that were collected on the previous segment to the set, and getMatchingDocs takes care of the last segment, since getLeafCollector doesn't get called anymore after the last segment has been collected.

However, this approach is not perfect. If you are leveraging Lucene's concurrent search capabilities, this forces the post collection logic to run in the current thread for at least one segment per slice, instead of using the executor. This is a missed opportunity for search concurrency, since post collection logic is not always cheap. For instance, in the case of FacetsCollector it needs to run DocIdSetBuilder.build() which may need to sort a large array of doc IDs. Having a LeafCollector.postCollect() API or something along these lines would help address this issue, as postCollect() would get called on the IndexSearcher's executor.

I looked at our collectors to get a sense of how many of our collectors could take advantage of a postCollect() hook and found the following ones:

org.apache.lucene.facet.FacetsCollector
org.apache.lucene.search.grouping.BlockGroupingCollector
org.apache.lucene.search.grouping.TermGroupFacetCollector
org.apache.lucene.search.suggest.document.TopSuggestDocsCollector
org.apache.lucene.search.CachingCollector

The text was updated successfully, but these errors were encountered:

msokolov · 2023-06-21T10:30:45Z

+1 to this. We've had to implement a finish() in some custom collectors for handling grouping IIRC it makes sense to make it standard

This adds `LeafCollector#finish` as a per-segment post-collection hook. While it was already possible to do this sort of things on top of the collector API before, a downside is that the last leaf would need to be post-collected in the current thread instead of using the executor, which is a missed opportunity for making queries concurrent. Closes apache#12375

gsmiller · 2023-06-21T12:30:08Z

+1. I'm also in favor. I like being able to do this in-thread for concurrent search.

jpountz added the type:enhancement label Jun 15, 2023

jpountz mentioned this issue Jun 21, 2023

Add a post-collection hook to LeafCollector. #12380

Merged

reta mentioned this issue Jun 28, 2023

CardinalityIT/NestedIT test failures with concurrent search enabled and AssertingCodec opensearch-project/OpenSearch#8303

Merged

4 tasks

jpountz closed this as completed in #12380 Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a postCollect hook to LeafCollector #12375

Add a postCollect hook to LeafCollector #12375

jpountz commented Jun 15, 2023

msokolov commented Jun 21, 2023

gsmiller commented Jun 21, 2023

Add a postCollect hook to LeafCollector #12375

Add a postCollect hook to LeafCollector #12375

Comments

jpountz commented Jun 15, 2023

Description

msokolov commented Jun 21, 2023

gsmiller commented Jun 21, 2023