[FEATURE] Implement parallel execution of sub-queries for hybrid search #279

martin-gaievski · 2023-09-04T22:47:09Z

Is your feature request related to a problem?

Currently individual queries of Hybrid query (aka sub-queries) are executed sequentially (Initial implementation that is done under #123). That may be sub-optimal as system may waste time on waiting.

Obvious approach for such execution is to run every sub-query execution using parallel thread and wait for results from all thread, this will be at shard level. Results then will be combined and sent to coordinator node all at once, as it's done today.

What solution would you like?

Sub-queries may be executed in parallel. Actual improvement must be verified by running benchmarks on sequential and parallel approaches.

jmazanec15 · 2024-04-29T15:26:52Z

This is not going to be ready for 2.14 (cc: @VijayanB ) moving to 2.15

VijayanB · 2024-05-08T17:09:18Z

Hybrid Query Hot Spots

While closely following execution of Hybrid Query, we found out that in three places (listed below), independent sub-queries are being executed synchronously. We could introduce multi threading inside our abstraction to parallelize these execution without making changes to existing api to improve query latency.

Query re-write

Query re-write is performed as first step during search to improve the search quality. HybridQuery performs rewrite by individually calling re-write api on every query sequentially, and, this will be performed in a loop, till no-more rewrite is possible. However, query re-write doesn’t just re-writes existing query object. For example, Lucene knn, during re-write, performs approximate search and build TopDocs, whereas, other Query can just convert into internal representation of Query without performing actual lucene search.

Create Scorer by Weight for every Segments

Once Query re-write is successful, Weight is created from Query. As mentioned earlier, Weight is required to store state of the Query. For given Weight, to perform search, we need to create an instance of Scorer for every segments to iterate documents inside segments and score accordingly. HybridQueryWeight, internally , iterates weights from sub-queries sequentially, to build HybridQueryScorer, which provides an abstraction over list of QueryScorer from its sub-queries. The cost of creating Scorer instance can vary and it is depends on type of query. For example, TermWeight, before creating scorer, from list of unique terms that are created during index, it will first navigate to the term value that is specified in the query, and, then it provides posting list iterator to TermScorer. Similarly, KNNWeight before creating KNNScorer, it performs approximate/exact search for given segments, and, provides list of KnnQueryResult as iterator to KNNScorer. Hence, creating scorer from Weight can be expensive depends on number of segments, size of segment and type of Query.

Calculate Score by Scorer for every matched documents

Once HybridScorer is created, search calls collector manager to collect doc id. HybridCollectorManager, calls hybridscores method to get list of scores from every subquery by calling individual subquery scorer’s score api. Like above, here, hybridscores api calls score api sequentially from scorer’s method. The cost of this score method depends on scorer. For example, BM25Similarity scorer, calculates score when calling during score api, whereas, Lucene Knn or Native Knn just returns the score that was previously calculated during query re-write and scorer creation. However, converting hybridscore implementation from single thread to multi thread improves latency of some query where score calculation is expensive actually at the time of score api.

Tasks

Add new thread pool for hybrid query executor
Add parallelization to scorer
Add parallelization to Query rewrite
Add parallelization to calculate hybrid scores
Merge into main
Benchmarks
- [ ] Compare performance with and without parallelization
- [ ] Compare performance with and without concurrent segment search

getsaurabh02 · 2024-06-24T20:14:34Z

Closing this out since changes are merged

martin-gaievski added Enhancements Increases software capabilities beyond original client specifications untriaged labels Sep 4, 2023

navneet1v added backlog All the backlog features should be marked with this label and removed untriaged labels Sep 6, 2023

navneet1v added this to Vector Search RoadMap Sep 6, 2023

github-project-automation bot moved this to Backlog in Vector Search RoadMap Sep 6, 2023

navneet1v added the good first issue Good for newcomers label Sep 15, 2023

martin-gaievski mentioned this issue Sep 25, 2023

Add hybrid search blog opensearch-project/project-website#2182

Merged

1 task

vamshin moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Feb 8, 2024

navneet1v removed the good first issue Good for newcomers label Feb 22, 2024

vamshin assigned VijayanB Mar 13, 2024

vamshin added the v2.14.0 label Mar 13, 2024

vamshin moved this from Backlog (Hot) to 2.14.0 in Vector Search RoadMap Apr 1, 2024

martin-gaievski mentioned this issue Apr 23, 2024

[META] Improve Hybrid query latency #704

Closed

navneet1v added the hybrid query performance optimization label Apr 24, 2024

jmazanec15 moved this from 2.14.0 to 2.15.0 in Vector Search RoadMap Apr 29, 2024

jmazanec15 added v2.15.0 and removed v2.14.0 labels Apr 29, 2024

VijayanB mentioned this issue May 15, 2024

Implement parallel execution of sub-queries for hybrid search #749

Merged

5 tasks

VijayanB mentioned this issue Jun 10, 2024

Implement parallel execution of sub-queries for hybrid search #781

Merged

5 tasks

getsaurabh02 closed this as completed Jun 24, 2024

github-project-automation bot moved this from 2.15.0 to ✅ Done in Vector Search RoadMap Jun 24, 2024

VijayanB mentioned this issue Aug 9, 2024

[Blog] Concurrent query execution in hybrid query (2.15) opensearch-project/project-website#3162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Implement parallel execution of sub-queries for hybrid search #279

[FEATURE] Implement parallel execution of sub-queries for hybrid search #279

martin-gaievski commented Sep 4, 2023

jmazanec15 commented Apr 29, 2024

VijayanB commented May 8, 2024 •

edited

Loading

getsaurabh02 commented Jun 24, 2024

[FEATURE] Implement parallel execution of sub-queries for hybrid search #279

[FEATURE] Implement parallel execution of sub-queries for hybrid search #279

Comments

martin-gaievski commented Sep 4, 2023

Is your feature request related to a problem?

What solution would you like?

jmazanec15 commented Apr 29, 2024

VijayanB commented May 8, 2024 • edited Loading

Hybrid Query Hot Spots

Query re-write

Create Scorer by Weight for every Segments

Calculate Score by Scorer for every matched documents

Tasks

getsaurabh02 commented Jun 24, 2024

VijayanB commented May 8, 2024 •

edited

Loading