Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introduce pluggable QueryCollectorContexts and CollectorManagers #13978

Open
jed326 opened this issue Jun 4, 2024 · 2 comments
Open

[RFC] Introduce pluggable QueryCollectorContexts and CollectorManagers #13978

jed326 opened this issue Jun 4, 2024 · 2 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Plugins RFC Issues requesting major changes Roadmap:Search Project-wide roadmap label Search:Aggregations Search Search query, autocomplete ...etc

Comments

@jed326
Copy link
Collaborator

jed326 commented Jun 4, 2024

TL;DR

I am proposing a way to separate out query preprocessing from query path execution in the existing SearchPlugin's QueryPhaseSearcher extension points. This will allow multiple SearchPlugins to provide Collector/CollectorManager related customization which is a limitation of the QueryPhaseSearcher today.

Background Information

Today the only extension point to plug in any search customization in the query phase is the QueryPhaseSearcher interface defined in the SearchPlugin. This requires any plugins that want to add customization into the query phase or even introduce additional Collectors implement an entire new QueryPhaseSearcher, and only 1 plugin's QueryPhaseSearcher can be used at a time.

Specifically, the main entry point is the searchWith method

boolean searchWith(
SearchContext searchContext,
ContextIndexSearcher searcher,
Query query,
LinkedList<QueryCollectorContext> collectors,
boolean hasFilterCollector,
boolean hasTimeout
) throws IOException;

which allows an implementation to [1] customize how the search path is executed using the ContextIndexSearcher (for example the ConcurrentQueryPhaseSearcher is used to enable concurrent search) as well as [2] perform any preprocessing or modifications on the Query, SearchContext, or even the QueryCollectorContexts linked list. However, because of the broad possibility of implementations here, today we are limited to a single QueryPhaseSearcher implementation per cluster. In this past this limitation created a conflict between the ConcurrentQueryPhaseSearcher and the HybridQueryPhaseSearcher which had to be resolved by making the latter extend the (wrapper class of) the former: opensearch-project/neural-search#356

In the first case [1] where a QueryPhaseSearcher is used to modify the search execution path there is not much we can do to support multiple QueryPhaseSearchers as ultimately we need to decide on only a single execution path to perform. For example, we can’t use the DefaultQueryPhaseSearcher and the ConcurrentQueryPhaseSearcher together because we can’t perform both concurrent and non-concurrent search (or at least it doesn’t make sense to).

However, for the second case [2] such as the HybridQueryPhaseSearcher where the searchWith extension point is not used to modify the search execution path but instead modify the Query on which to perform search, multiple preprocessing implementations should be able to be composed together.

Additionally the aggregationProcessor method

default AggregationProcessor aggregationProcessor(SearchContext searchContext) {
return new DefaultAggregationProcessor();
}

is used to setup and post process aggregation related collectors however, today it is also the only place which can be used to introduce custom CollectorManager implementations. For example, the HybridAggregationProcessor uses this extension point to introduce the HybridCollectorManager without making any modifications to the default AggregationProcessor. Since the AggregationProcessor is introduced in the same QueryPhaseSearcher interface, the same limitation exists today where only a single AggregationProcessor can be used per cluster. However, in cases like this where a plugin strictly wants to add additional CollectorManagers to the queryCollectorManagers map there are no conflicts across plugin implementations so we should be able to support an additional extension point here to allow SearchPlugins to provide custom CollectorManagers.

Describe the solution you'd like

At a high level I am proposing to decouple the query preprocessing logic (performing any processing or modifications with Query, SearchContext, and QueryCollectorContexts) from the search execution path logic. Each of the Query, SearchContext, and QueryCollectorContexts could have their own distinct pluggable extension points that allow multiple plugins to provide implementations. Each of these components deserve their own separate RFC to discuss how to best proceed so this issue will focus on making QueryCollectorContext and associated CollectorManagers pluggable.

Today a LinkedList of QueryCollectorContexts is created in QueryPhase::executeInternal with various default QueryCollectorContexts based on the query parameters (such as terminate_after, min_score, etc.). It is subsequently passed to the previously mentioned QueryPhaseSearcher::searchWith method where any plugin customization can happen and this is also where it is usually transformed from the QueryCollectorContexts into the Collector tree and passed to IndexSearcher::search to perform the query phase.

I am proposing to introduce a QueryCollectorContextProvider construct which can take pluggable QueryCollectorContext constructors to override default contexts.

I see 2 use cases for this today:

  1. As a performance optimization for the hybrid query some custom logic was introduced to allow excluding the TopDocsCollector. This was done by adding an additional SearchWithCollector method which is pretty fragile and still tightly coupled to the QueryPhaseSearcher. This could be done instead by overriding the default TopDocsQueryCollectorContext with an empty context based on whatever condition checks need to be performed. Ref: Refactor implementations of query phase searcher, add empty QueryCollectorContext #13481.
  2. As described in [RFC] Search performance on warm index #13806, one of the warm index search performance optimizations is to perform prefetch on the remote store backed indices and a key component of this may be injecting prefetch logic into the Collector tree. This could involve taking the Aggregator tree and wrapping it in a PrefetchableCollector type of construct to capture the doc IDs on which to perform prefetch related actions. This could be accomplished by overriding the default search_multi QueryCollectorContext via this new extension point.

Moreover, I am also proposing to add a plugin hook to allow SearchPlugins to add implementations directly to the queryCollectorManagers map. Although custom CollectorManagers could be plugged in with the above proposal of overriding the QueryCollectorContext implementation, this would still have the same limitation of only 1 plugin implementation at a time. CollectorManagers do not need to be bound by this restriction and we can open up an extension point to allow as many pluggable CollectorManagers as needed.

I also see 2 use cases for this today:

  1. As previously mentioned, the HybridAggregationProcessor simply adds a HybridCollectorManager to the queryCollectorManagers map however today is has to implement a whole AggregationProcessor just to do so despite not making any changes to the default AggregationCollectorManagers.
  2. Some users may want a way to plug in custom Lucene Collector implementations: https://opensearch.slack.com/archives/C051JEH8MNU/p1715869554173169

Describe alternatives you've considered

Instead of an extension point for overriding default QueryCollectorContexts as well as an extension point for the queryCollectorManagers map, we could instead introduce a notion of priority to QueryCollectorContext and allow plugins to provide their own QueryCollectorContext implementations with priority, similar to how ActionFilters work today. All QueryCollectorContext implementations would then be chained together based on their priority when transforming the QueryCollectorContext into the Collector tree. Rough POC: main...jed326:OpenSearch:ordered-context

This allows plugins to provide an uncapped number of QueryCollectorContext implementations, however since the order of the contexts will determine the structure of the Collector tree this means the context priority will be tightly coupled with it’s functionality and each QueryCollectorContext would likely need to have many instanceof checks to ensure it is performing its intended function, similar to how BucketCollectorProcessor is implemented today:

public void processPostCollection(Collector collectorTree) throws IOException {
final Queue<Collector> collectors = new LinkedList<>();
collectors.offer(collectorTree);
while (!collectors.isEmpty()) {
Collector currentCollector = collectors.poll();
if (currentCollector instanceof InternalProfileCollector) {
collectors.offer(((InternalProfileCollector) currentCollector).getCollector());
} else if (currentCollector instanceof MinimumScoreCollector) {
collectors.offer(((MinimumScoreCollector) currentCollector).getCollector());
} else if (currentCollector instanceof MultiCollector) {
for (Collector innerCollector : ((MultiCollector) currentCollector).getCollectors()) {
collectors.offer(innerCollector);
}
} else if (currentCollector instanceof BucketCollector) {
((BucketCollector) currentCollector).postCollection();
// Perform build aggregation during post collection
if (currentCollector instanceof Aggregator) {
((Aggregator) currentCollector).buildTopLevel();
} else if (currentCollector instanceof MultiBucketCollector) {
for (Collector innerCollector : ((MultiBucketCollector) currentCollector).getCollectors()) {
collectors.offer(innerCollector);
}
}
}
}
}

@jed326 jed326 added enhancement Enhancement or improvement to existing feature or request untriaged Plugins Search:Aggregations Search Search query, autocomplete ...etc labels Jun 4, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap Jun 4, 2024
@jed326
Copy link
Collaborator Author

jed326 commented Jun 4, 2024

Tagging some individuals for thoughts: @sohami @msfroh @mch2 @kasundra07 @reta

Also, I believe this should at least partly address the concerns raised in #13320, tagging @conggguan and @martin-gaievski for some thoughts on this as well.

@peternied peternied added RFC Issues requesting major changes and removed untriaged labels Jun 5, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7]
@jed326 Thanks for creating this RFC, look forward to seeing how this pans out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Plugins RFC Issues requesting major changes Roadmap:Search Project-wide roadmap label Search:Aggregations Search Search query, autocomplete ...etc
Projects
Status: New
Status: Later (6 months plus)
Status: No status
Development

No branches or pull requests

3 participants