-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parallel execution of sub-queries for hybrid search #749
Implement parallel execution of sub-queries for hybrid search #749
Conversation
53b9260
to
04ab2ac
Compare
f7e0ef0
to
2750ad4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/parallelize-hybrid-search #749 +/- ##
====================================================================
Coverage ? 84.89%
Complexity ? 812
====================================================================
Files ? 65
Lines ? 2490
Branches ? 410
====================================================================
Hits ? 2114
Misses ? 213
Partials ? 163 ☔ View full report in Codecov by Sentry. |
2750ad4
to
d1e4789
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked PR at high level and added comments. Please resolve them after that I will do 1 more review of the PR.
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutorCollectorManager.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutorCollector.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/query/HybridQueryRewriteCollectorManager.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/query/HybridQuery.java
Outdated
Show resolved
Hide resolved
perform better. For hybrid query we need to track progress of re-write for all sub-queries */ | ||
|
||
boolean actuallyRewritten = rewrittenQuery != query; | ||
return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid storing queries in a map? and who is going to store this simple map entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HybridQueryExecutorCollector.collect() will store this entry. If we don't store map, we can't assemble back to caller once the execution is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see if you can use optional instead. That will simplify the data structure and make it more readable
return rewrittenQuery != query ? Optional.of(rewrittenQuery) : Optional.empty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have data or can we run benchmark to see how much overhead (if any) adds the Optional
approach? I agree this way code looks simpler, but if we're aiming for performance improvement and this is critical section then we may want to save CPU cycles of wrapping/unwrapping result to Optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and +1 to Navneet's point regarding map and specifically using Query as a key. I was using similar approach in initial implementation of hybrid query, problem is that for some queries hashCode
can be slow.
How much time this will be called - one time per sub-query x num of shards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood, I mismatch the Map.Entry<Query, Boolean> with Map<Query, Boolean>, I guess the Map.Entry is used to operate on pair of objects. I'm good now, thanks Vijay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used optional to avoid null condition, but we need query irrespective of whether query is re-written or not.
We can explicitly check for null, my point was performance. Doing first Optional.of(rewrittenQuery)
, and later optional.get()
both will add some cycles of wrapping query object into Optional and then unwrapping it. If we have done benchmarks with Optional and without it then we can use it as datapoint to make a decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will run benchmark with and without before merging into main. In the meantime, can we keep Optional? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm not sure what is the delta in performance if any, let's keep it with Optional, run benchmark and act per per data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intent was never to have an Optional as a member variable, the intent was to only return optional to indicate the value is not rewritten. It got misunderstood.
I realize that the query is needed irrespective of if its rewritten so Optional cannot be used. An Entry
is a good option here.
In terms of performance, if Optional does turn out to be an issue then we might have to deep dive and see if wrapping results in an object is an issue overall
perform better. For hybrid query we need to track progress of re-write for all sub-queries */ | ||
actuallyRewritten |= rewrittenSub != subQuery; | ||
rewrittenSubQueries.add(rewrittenSub); | ||
final HybridQueryExecutorCollector<IndexSearcher, Map.Entry<Query, Boolean>> collector = manager.newCollector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding, why is a collector chosen here for collecting results? you could have plugged in TaskExecutor here and then simply use the list of results returned by invokeAll. The entire collector seems to be a round about way to get the results. Manager would have been useful if you wanted to return a different collector based on certain conditions IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The main idea is to move logic of collecting and reducing/merging into its own manager. By introducing TaskExecutor inside Scorer/Weight we are coupling how to parallelize with what needs to be parallelized. With introduction of Collector and Collector Manager, we decoupled parallelization from logic. IMO, this abstracts parallelization implementation from all of its caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets dissect this a little more. There are 3 classes here:
- HybridQueryExecutor: responsible for initializing and holding a common threadpool for query, scorer and weight,
- HybridQueryExecutorCollectorManager: Responsible for giving a new collector
- HybridQueryExecutorCollector : Holds the query results
With introduction of Collector and Collector Manager, we decoupled parallelization from logic. IMO, this abstracts parallelization implementation from all of its caller.
HybridQueryExecutor seems to be a good wrapper to make sure the same task executor is used. I am more curious about the additional value of HybridQueryExecutorCollectorManager
and HybridQueryExecutorCollector
Why can't it be
final List<Callable<Entry<Query, Boolean>>> queryRewriteTasks = new ArrayList<>();
// for each subquery
queryRewriteTasks.add(() -> rewriteQuery(subQuery); //rewrite needs to return entry<query, boolean>;
List<Entry<Query, Boolean>> rewrittenQueries = HybridQueryExecutor.getExecutor().invokeAll(queryRewriteTasks);
// rest of the logic
This way you don't have to worry about the threadsafety of the collectors. There are some utility methods in collector manager which arent related to managing a collector but more of related to rewritten queries itself.
Overall while this works fine, I want to make sure these interfaces have a clear definition and its clear on how to use them. Currently it seems like we can simplify this by not having it unless I am misunderstanding the value it brings
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Show resolved
Hide resolved
* Query phase to parallelize sub query's action to improve latency | ||
*/ | ||
@RequiredArgsConstructor(staticName = "newCollector") | ||
public final class HybridQueryExecutorCollector<I, R> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is HybridQueryExecutorCollector
thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no setters and only way to update result is by calling collect method. Do you have any particular concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since collect()
could set result
, if collect()
and getResults()
are called asynchronously, technically, there could be race condition, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Currently getResult() is called from collector manager, where the contract is that it can be called only after all collectors are finished collection. This was taken care in the implementation. I can add this note in collector as well. i am reluctant to add synchronize to result variable, since it will add additional latency for scenario that is not possible at this moment. Do you have any other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a comment is a good intention. you won't be able to enforce it. You are relying on the hope that the code doesn't change in future to hit that race condition so there will always be a risk.
If you really want to enforce; your options here would be to synchronize or move away from result collection
to returning results
as pointed out in this thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am reluctant to add synchronize to result variable, since it will add additional latency for scenario that is not possible at this moment
Actually, the overhead of synchronization is way less than we thought, here are some synchronization benchmark articles I found,
- https://baptiste-wicht.com/posts/2010/09/java-synchronization-mutual-exclusion-benchmark.html
- https://isuru-perera.blogspot.com/2016/05/benchmarking-java-locks-with-counters.html
One operation of "synchronized" is < 1 micro second, considering the time spending on actual query logic with intensive I/O, this is negligible and you have simple implementation with concurrent situation covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @VijayanB comment. I think the purpose of this whole PR is to reduce latency. Therefore, I think @VijayanB it would good if you compare the benchmark results, if you see there is little degradation also in the latency, then we should take a call on this tradeoff whether to keep synchronized or not.
src/main/java/org/opensearch/neuralsearch/query/HybridQueryRewriteCollectorManager.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/executors/HybridQueryExecutor.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/neuralsearch/query/HybridQueryRewriteCollectorManager.java
Outdated
Show resolved
Hide resolved
perform better. For hybrid query we need to track progress of re-write for all sub-queries */ | ||
|
||
boolean actuallyRewritten = rewrittenQuery != query; | ||
return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have data or can we run benchmark to see how much overhead (if any) adds the Optional
approach? I agree this way code looks simpler, but if we're aiming for performance improvement and this is critical section then we may want to save CPU cycles of wrapping/unwrapping result to Optional.
perform better. For hybrid query we need to track progress of re-write for all sub-queries */ | ||
|
||
boolean actuallyRewritten = rewrittenQuery != query; | ||
return new AbstractMap.SimpleEntry(rewrittenQuery, actuallyRewritten); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and +1 to Navneet's point regarding map and specifically using Query as a key. I was using similar approach in initial implementation of hybrid query, problem is that for some queries hashCode
can be slow.
How much time this will be called - one time per sub-query x num of shards?
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
Signed-off-by: Vijayan Balasubramanian <[email protected]>
bf1dc5d
to
6d8c080
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you Vijay.
Please plan to add unit tests for the new classes from the executor package. We can do it in a separate PR.
Signed-off-by: Vijayan Balasubramanian <[email protected]>
d889f37
into
opensearch-project:feature/parallelize-hybrid-search
…arch-project#749) Implement parallel execution of sub-queries for hybrid search Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>
…arch-project#749) Implement parallel execution of sub-queries for hybrid search Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>
* Implement parallel execution of sub-queries for hybrid search (#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>
* Implement parallel execution of sub-queries for hybrid search (#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]> (cherry picked from commit 76090de)
…arch-project#781) * Implement parallel execution of sub-queries for hybrid search (opensearch-project#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]>
… search (#749) (#786) * Implement parallel execution of sub-queries for hybrid search (#781) * Implement parallel execution of sub-queries for hybrid search (#749) Add new thread pool to schedule tasks that are related to hybrid query execution Register executor builders with Plugin Use Lucene's Task Executor to execute and collect results Parallelize Query re-write Parallelize score supplier creation Parallelize build hybrid scores Signed-off-by: Vijayan Balasubramanian <[email protected]> * Update package name in 2.15 which is different from main Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]>
Description
Issues Resolved
#279
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.