You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is a common need to run some logic after a segment has been collected. Even though, I can't find previous instances of this discussion I'm pretty sure that this has been raised several times in the past, and the answer was essentially that this logic can easily be implemented on top of Lucene. One good example of this is our own FacetsCollector, which collects the set of matching docs per segment: getLeafCollector appends the set of doc IDs that were collected on the previous segment to the set, and getMatchingDocs takes care of the last segment, since getLeafCollector doesn't get called anymore after the last segment has been collected.
However, this approach is not perfect. If you are leveraging Lucene's concurrent search capabilities, this forces the post collection logic to run in the current thread for at least one segment per slice, instead of using the executor. This is a missed opportunity for search concurrency, since post collection logic is not always cheap. For instance, in the case of FacetsCollector it needs to run DocIdSetBuilder.build() which may need to sort a large array of doc IDs. Having a LeafCollector.postCollect() API or something along these lines would help address this issue, as postCollect() would get called on the IndexSearcher's executor.
I looked at our collectors to get a sense of how many of our collectors could take advantage of a postCollect() hook and found the following ones:
This adds `LeafCollector#finish` as a per-segment post-collection hook. While
it was already possible to do this sort of things on top of the collector API
before, a downside is that the last leaf would need to be post-collected in the
current thread instead of using the executor, which is a missed opportunity for
making queries concurrent.
Closesapache#12375
Description
It is a common need to run some logic after a segment has been collected. Even though, I can't find previous instances of this discussion I'm pretty sure that this has been raised several times in the past, and the answer was essentially that this logic can easily be implemented on top of Lucene. One good example of this is our own
FacetsCollector
, which collects the set of matching docs per segment:getLeafCollector
appends the set of doc IDs that were collected on the previous segment to the set, andgetMatchingDocs
takes care of the last segment, sincegetLeafCollector
doesn't get called anymore after the last segment has been collected.However, this approach is not perfect. If you are leveraging Lucene's concurrent search capabilities, this forces the post collection logic to run in the current thread for at least one segment per slice, instead of using the executor. This is a missed opportunity for search concurrency, since post collection logic is not always cheap. For instance, in the case of
FacetsCollector
it needs to runDocIdSetBuilder.build()
which may need to sort a large array of doc IDs. Having aLeafCollector.postCollect()
API or something along these lines would help address this issue, aspostCollect()
would get called on theIndexSearcher
'sexecutor
.I looked at our collectors to get a sense of how many of our collectors could take advantage of a
postCollect()
hook and found the following ones:org.apache.lucene.facet.FacetsCollector
org.apache.lucene.search.grouping.BlockGroupingCollector
org.apache.lucene.search.grouping.TermGroupFacetCollector
org.apache.lucene.search.suggest.document.TopSuggestDocsCollector
org.apache.lucene.search.CachingCollector
The text was updated successfully, but these errors were encountered: