[RFC] Introduce pluggable QueryCollectorContexts and CollectorManagers #13978

jed326 · 2024-06-04T22:23:36Z

TL;DR

I am proposing a way to separate out query preprocessing from query path execution in the existing SearchPlugin's QueryPhaseSearcher extension points. This will allow multiple SearchPlugins to provide Collector/CollectorManager related customization which is a limitation of the QueryPhaseSearcher today.

Background Information

Today the only extension point to plug in any search customization in the query phase is the QueryPhaseSearcher interface defined in the SearchPlugin. This requires any plugins that want to add customization into the query phase or even introduce additional Collectors implement an entire new QueryPhaseSearcher, and only 1 plugin's QueryPhaseSearcher can be used at a time.

Specifically, the main entry point is the searchWith method

OpenSearch/server/src/main/java/org/opensearch/search/query/QueryPhaseSearcher.java

Lines 39 to 46 in 293905a

    
           boolean searchWith( 
        
               SearchContext searchContext, 
        
               ContextIndexSearcher searcher, 
        
               Query query, 
        
               LinkedList<QueryCollectorContext> collectors, 
        
               boolean hasFilterCollector, 
        
               boolean hasTimeout 
        
           ) throws IOException;

which allows an implementation to [1] customize how the search path is executed using the ContextIndexSearcher (for example the ConcurrentQueryPhaseSearcher is used to enable concurrent search) as well as [2] perform any preprocessing or modifications on the Query, SearchContext, or even the QueryCollectorContexts linked list. However, because of the broad possibility of implementations here, today we are limited to a single QueryPhaseSearcher implementation per cluster. In this past this limitation created a conflict between the ConcurrentQueryPhaseSearcher and the HybridQueryPhaseSearcher which had to be resolved by making the latter extend the (wrapper class of) the former: opensearch-project/neural-search#356

In the first case [1] where a QueryPhaseSearcher is used to modify the search execution path there is not much we can do to support multiple QueryPhaseSearchers as ultimately we need to decide on only a single execution path to perform. For example, we can’t use the DefaultQueryPhaseSearcher and the ConcurrentQueryPhaseSearcher together because we can’t perform both concurrent and non-concurrent search (or at least it doesn’t make sense to).

However, for the second case [2] such as the HybridQueryPhaseSearcher where the searchWith extension point is not used to modify the search execution path but instead modify the Query on which to perform search, multiple preprocessing implementations should be able to be composed together.

Additionally the aggregationProcessor method

OpenSearch/server/src/main/java/org/opensearch/search/query/QueryPhaseSearcher.java

Lines 53 to 55 in 293905a

    
           default AggregationProcessor aggregationProcessor(SearchContext searchContext) { 
        
               return new DefaultAggregationProcessor(); 
        
           }

is used to setup and post process aggregation related collectors however, today it is also the only place which can be used to introduce custom CollectorManager implementations. For example, the HybridAggregationProcessor uses this extension point to introduce the HybridCollectorManager without making any modifications to the default AggregationProcessor. Since the AggregationProcessor is introduced in the same QueryPhaseSearcher interface, the same limitation exists today where only a single AggregationProcessor can be used per cluster. However, in cases like this where a plugin strictly wants to add additional CollectorManagers to the queryCollectorManagers map there are no conflicts across plugin implementations so we should be able to support an additional extension point here to allow SearchPlugins to provide custom CollectorManagers.

Describe the solution you'd like

At a high level I am proposing to decouple the query preprocessing logic (performing any processing or modifications with Query, SearchContext, and QueryCollectorContexts) from the search execution path logic. Each of the Query, SearchContext, and QueryCollectorContexts could have their own distinct pluggable extension points that allow multiple plugins to provide implementations. Each of these components deserve their own separate RFC to discuss how to best proceed so this issue will focus on making QueryCollectorContext and associated CollectorManagers pluggable.

Today a LinkedList of QueryCollectorContexts is created in QueryPhase::executeInternal with various default QueryCollectorContexts based on the query parameters (such as terminate_after, min_score, etc.). It is subsequently passed to the previously mentioned QueryPhaseSearcher::searchWith method where any plugin customization can happen and this is also where it is usually transformed from the QueryCollectorContexts into the Collector tree and passed to IndexSearcher::search to perform the query phase.

I am proposing to introduce a QueryCollectorContextProvider construct which can take pluggable QueryCollectorContext constructors to override default contexts.

I see 2 use cases for this today:

As a performance optimization for the hybrid query some custom logic was introduced to allow excluding the TopDocsCollector. This was done by adding an additional SearchWithCollector method which is pretty fragile and still tightly coupled to the QueryPhaseSearcher. This could be done instead by overriding the default TopDocsQueryCollectorContext with an empty context based on whatever condition checks need to be performed. Ref: Refactor implementations of query phase searcher, add empty QueryCollectorContext #13481.
As described in [RFC] Search performance on warm index #13806, one of the warm index search performance optimizations is to perform prefetch on the remote store backed indices and a key component of this may be injecting prefetch logic into the Collector tree. This could involve taking the Aggregator tree and wrapping it in a PrefetchableCollector type of construct to capture the doc IDs on which to perform prefetch related actions. This could be accomplished by overriding the default search_multi QueryCollectorContext via this new extension point.

Moreover, I am also proposing to add a plugin hook to allow SearchPlugins to add implementations directly to the queryCollectorManagers map. Although custom CollectorManagers could be plugged in with the above proposal of overriding the QueryCollectorContext implementation, this would still have the same limitation of only 1 plugin implementation at a time. CollectorManagers do not need to be bound by this restriction and we can open up an extension point to allow as many pluggable CollectorManagers as needed.

I also see 2 use cases for this today:

As previously mentioned, the HybridAggregationProcessor simply adds a HybridCollectorManager to the queryCollectorManagers map however today is has to implement a whole AggregationProcessor just to do so despite not making any changes to the default AggregationCollectorManagers.
Some users may want a way to plug in custom Lucene Collector implementations: https://opensearch.slack.com/archives/C051JEH8MNU/p1715869554173169

Describe alternatives you've considered

Instead of an extension point for overriding default QueryCollectorContexts as well as an extension point for the queryCollectorManagers map, we could instead introduce a notion of priority to QueryCollectorContext and allow plugins to provide their own QueryCollectorContext implementations with priority, similar to how ActionFilters work today. All QueryCollectorContext implementations would then be chained together based on their priority when transforming the QueryCollectorContext into the Collector tree. Rough POC: main...jed326:OpenSearch:ordered-context

This allows plugins to provide an uncapped number of QueryCollectorContext implementations, however since the order of the contexts will determine the structure of the Collector tree this means the context priority will be tightly coupled with it’s functionality and each QueryCollectorContext would likely need to have many instanceof checks to ensure it is performing its intended function, similar to how BucketCollectorProcessor is implemented today:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/BucketCollectorProcessor.java

Lines 60 to 86 in 618782d

    
           public void processPostCollection(Collector collectorTree) throws IOException { 
        
               final Queue<Collector> collectors = new LinkedList<>(); 
        
               collectors.offer(collectorTree); 
        
               while (!collectors.isEmpty()) { 
        
                   Collector currentCollector = collectors.poll(); 
        
                   if (currentCollector instanceof InternalProfileCollector) { 
        
                       collectors.offer(((InternalProfileCollector) currentCollector).getCollector()); 
        
                   } else if (currentCollector instanceof MinimumScoreCollector) { 
        
                       collectors.offer(((MinimumScoreCollector) currentCollector).getCollector()); 
        
                   } else if (currentCollector instanceof MultiCollector) { 
        
                       for (Collector innerCollector : ((MultiCollector) currentCollector).getCollectors()) { 
        
                           collectors.offer(innerCollector); 
        
                       } 
        
                   } else if (currentCollector instanceof BucketCollector) { 
        
                       ((BucketCollector) currentCollector).postCollection(); 
        
                       // Perform build aggregation during post collection 
        
                       if (currentCollector instanceof Aggregator) { 
        
                           ((Aggregator) currentCollector).buildTopLevel(); 
        
                       } else if (currentCollector instanceof MultiBucketCollector) { 
        
                           for (Collector innerCollector : ((MultiBucketCollector) currentCollector).getCollectors()) { 
        
                               collectors.offer(innerCollector); 
        
                           } 
        
                       } 
        
                   } 
        
               } 
        
           }

The text was updated successfully, but these errors were encountered:

jed326 · 2024-06-04T22:34:15Z

Tagging some individuals for thoughts: @sohami @msfroh @mch2 @kasundra07 @reta

Also, I believe this should at least partly address the concerns raised in #13320, tagging @conggguan and @martin-gaievski for some thoughts on this as well.

peternied · 2024-06-05T15:24:38Z

[Triage - attendees 1 2 3 4 5 6 7]
@jed326 Thanks for creating this RFC, look forward to seeing how this pans out.

jed326 added enhancement Enhancement or improvement to existing feature or request untriaged Plugins Search:Aggregations Search Search query, autocomplete ...etc labels Jun 4, 2024

github-project-automation bot added this to Search Project Board and OpenSearch Roadmap Jun 4, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Jun 4, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap Jun 4, 2024

peternied added RFC Issues requesting major changes and removed untriaged labels Jun 5, 2024

jed326 added this to Tiered Writable Index Jun 5, 2024

jed326 self-assigned this Jun 5, 2024

jed326 mentioned this issue Jun 26, 2024

[Feature Request] Allow override individual sub phase of Fetch phase #14546

Open

andrross added the Roadmap:Search Project-wide roadmap label label Aug 6, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024

Gankris96 mentioned this issue Aug 28, 2024

Pluggable deciders for concurrent search #15363

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Introduce pluggable QueryCollectorContexts and CollectorManagers #13978

[RFC] Introduce pluggable QueryCollectorContexts and CollectorManagers #13978

jed326 commented Jun 4, 2024 •

edited

Loading

jed326 commented Jun 4, 2024

peternied commented Jun 5, 2024

[RFC] Introduce pluggable QueryCollectorContexts and CollectorManagers #13978

[RFC] Introduce pluggable QueryCollectorContexts and CollectorManagers #13978

Comments

jed326 commented Jun 4, 2024 • edited Loading

TL;DR

Background Information

Describe the solution you'd like

Describe alternatives you've considered

jed326 commented Jun 4, 2024

peternied commented Jun 5, 2024

jed326 commented Jun 4, 2024 •

edited

Loading