Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator can return partial results after the timeout when allow_partial_search_results is true #16681

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kkewwei
Copy link
Contributor

@kkewwei kkewwei commented Nov 19, 2024

Description

In query phase, the coordinate concurrently search each shard, If any shard is blocked or responds very slowly, the coordination node will be stuck even if the timeout is set.

The pr supports timeout waiting, if the timeout is exceeded, the coordinator considers the shard as failed and gos on the fetch phase.

Related Issues

Resolves #817 (comment)

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added the enhancement Enhancement or improvement to existing feature or request label Nov 19, 2024
@kkewwei kkewwei force-pushed the return_partial_result branch from 3b9454f to 27eed03 Compare November 19, 2024 09:13
@kkewwei kkewwei changed the title opensearch should returns partial results after the timeout in coordinate node when allow_partial_search_results is true Coordinator can return partial results after the timeout when allow_partial_search_results is true Nov 19, 2024
@kkewwei kkewwei force-pushed the return_partial_result branch from 27eed03 to 61d84d1 Compare November 19, 2024 09:14
Copy link
Contributor

❌ Gradle check result for 61d84d1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

long leftTimeMills;
if (queryPhase) {
// it's costly in query phase.
leftTimeMills = task.queryPhaseTimeout() - (System.currentTimeMillis() - task.startTimeMills());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation behind the queryPhaseTimeoutPercentage concept? I think it's going to depend on the query and the setup whether query or fetch phase takes longer and it doesn't seem super intuitive for a user to understand how to use this. For example a query that matches a lot of sparse documents using searchable snapshots might spend much longer in the fetch phase while a query that performs complex aggregations might spend a lot longer in the query phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to reserve some time for the subsequent phase as a backup measure, to ensure each stage can be allocated a certain amount of time. Of course, if the previous stage takes a very short time, it won't affect the remaining time available for the subsequent phases either.

If no such reservation is made, and a shard is blocked in query phase and uses up all the time, even if it returns after the timeout, there won't be any executable time left for the subsequent stages, and the timeout would be meaningless in that case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we have separate timeouts for the coordinator and the shard level search tasks then? I still think it's pretty unintuitive to use a % like this.

Copy link
Contributor Author

@kkewwei kkewwei Nov 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 , thanks for your reply, how about importing a new settings coordinator_timeout?

Copy link
Contributor

❌ Gradle check result for f2cb9f7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 3e56548: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@kkewwei kkewwei force-pushed the return_partial_result branch from 3e56548 to c55f45e Compare November 25, 2024 04:41
Copy link
Contributor

❌ Gradle check result for c55f45e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@kkewwei kkewwei force-pushed the return_partial_result branch from c55f45e to e2305c7 Compare November 25, 2024 06:03
Copy link
Contributor

❌ Gradle check result for e2305c7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@kkewwei kkewwei force-pushed the return_partial_result branch from e2305c7 to b6802ae Compare November 25, 2024 06:54
Copy link
Contributor

❌ Gradle check result for b6802ae: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@kkewwei
Copy link
Contributor Author

kkewwei commented Nov 28, 2024

❌ Gradle check result for b6802ae: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

org.opensearch.search.basic.SearchWithRandomExceptionsIT.testRandomExceptions #15828

@kkewwei kkewwei force-pushed the return_partial_result branch 2 times, most recently from 8a2e7cd to d32aac8 Compare November 29, 2024 23:43
Copy link
Contributor

❌ Gradle check result for d32aac8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Dec 5, 2024

❌ Gradle check result for 0bb55bc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…artial_search_results is true

Signed-off-by: kkewwei <[email protected]>
Signed-off-by: kkewwei <[email protected]>
@kkewwei kkewwei force-pushed the return_partial_result branch from 0bb55bc to 7e5b243 Compare December 6, 2024 03:10
Copy link
Contributor

github-actions bot commented Dec 6, 2024

❕ Gradle check result for 7e5b243: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Dec 6, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 10 lines in your changes missing coverage. Please review.

Project coverage is 72.10%. Comparing base (42dc22e) to head (7e5b243).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...va/org/opensearch/action/search/SearchRequest.java 75.00% 0 Missing and 4 partials ⚠️
...ensearch/action/search/SearchTransportService.java 72.72% 2 Missing and 1 partial ⚠️
...opensearch/action/search/SearchRequestBuilder.java 0.00% 2 Missing ⚠️
...arch/rest/action/search/RestMultiSearchAction.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16681      +/-   ##
============================================
+ Coverage     72.05%   72.10%   +0.05%     
- Complexity    65183    65201      +18     
============================================
  Files          5318     5318              
  Lines        303993   304030      +37     
  Branches      43990    43997       +7     
============================================
+ Hits         219028   219214     +186     
+ Misses        67046    66889     -157     
- Partials      17919    17927       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❌ Gradle check result for 3a7fd6e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support timeout based search request cancellation
2 participants