CCS: Should `timeout` parameter be honored? #32678

spinscale · 2018-08-07T11:57:17Z

While testing some Cross Cluster Search behaviour I noticed search timeouts are not being honored.

Note: This might be intended behaviour so feel free to close this issue.

Reproduction, started two one node clusters with different clustername

./bin/elasticsearch -Enode.name=rhincodon -Ecluster.name=alr -Enode.max_local_storage_nodes=2
./bin/elasticsearch -Enode.name=remote01 -Ecluster.name=remote -Enode.max_local_storage_nodes=2

configure CCS

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "search": {
      "remote": {
        "cluster_remote": {
          "seeds": [
            "127.0.0.1:9301"
          ]
        }
      }
    }
  }
}
'

Now check that CCS works

curl -X PUT localhost:9200/foo
curl -X PUT localhost:9201/bar
curl localhost:9200/cluster_remote:bar,foo/_search

In order to simulate a failure I stopped the remote cluster node, and ran netcat -l 9301, a service that receives data but does not reply anything (i.e. emulating a GC).

Now running a search ends up like this

curl 'localhost:9200/cluster_remote:bar,foo/_search?timeout=5s'
{"error":{"root_cause":[{"type":"connect_transport_exception","reason":"[][127.0.0.1:9301] handshake_timeout[30s]"}],"type":"connect_transport_exception","reason":"[][127.0.0.1:9301] handshake_timeout[30s]"},"status":500}

As you can see the timeout is the 30s handshake timeout and curl also only returns after 30 seconds.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-08-07T11:57:18Z

Pinging @elastic/es-search-aggs

javanna · 2018-08-07T13:48:03Z

The search timeout that's provided in the request is applied to the query phase like in an ordinary search request. But with this simulated network disruption (or anytime a remote cluster is down when the cross-cluster search request starts), the request fails earlier, what fails is the initial search_shards call made to the remote cluster, which is a different type of request compared to search and does not support the timeout parameter. Maybe we should apply the search timeout as a transport timeout when sending the search shards request here ?

javanna · 2018-08-10T15:20:48Z

We discussed this during fix-it-friday and we decided that we should try and set the timeout as a transport timeout to the search shards request as it is unexpected that the request times out after 30 seconds when the search timeout was set to 10 seconds. Each search_shards request (one per cluster) will have the same timeout as search, which will happen later, so this is best-effort and not really a guarantee that the request will never take longer than the timeout overall. We discussed also tweaking the search timeout to what is left of the initial search timeout once the search_shards phase is completed (given that the coordinating node waits for all the clusters to be back before moving on to the search phase), and we said it may be nice but not strictly needed, we may consider doing this later.

javanna · 2023-03-07T14:04:49Z

We haven't come around to make the recommended change above. Search and async search support now automatic cancellation on connection close. My thinking is that we should propagate the cancellation to the search shards call that async search (as well as search when not minimizing roundtrips) makes. I believe this would be enough and we would not need to propagate the search timeout to the search shards call. @DaveCTurner do you have thoughts?

DaveCTurner · 2023-03-07T14:30:38Z

👍 to having one timeout on the client side and doing everything else via cancellation.

…a to hang (#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs. ![CleanShot 2024-11-18 at 23 47 34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043) Fixes #200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>

…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs. ![CleanShot 2024-11-18 at 23 47 34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043) Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]> (cherry picked from commit 96fd4b6)

… can cause Kibana to hang (#200476) (#201025) # Backport This will backport the following commits from `main` to `8.x`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Davis McPhee <[email protected]>

…k can cause Kibana to hang (#200476) (#201024) # Backport This will backport the following commits from `main` to `8.16`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Davis McPhee <[email protected]>

…k can cause Kibana to hang (#200476) (#201023) # Backport This will backport the following commits from `main` to `8.15`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  --------- Co-authored-by: Davis McPhee <[email protected]>

…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs. ![CleanShot 2024-11-18 at 23 47 34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043) Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>

spinscale added the :Search/Search Search-related issues that do not fall into other categories label Aug 7, 2018

javanna added discuss >enhancement and removed discuss labels Aug 10, 2018

javanna self-assigned this Aug 10, 2018

javanna mentioned this issue Oct 11, 2018

Improve CCS network faults detection #34405

Closed

5 tasks

rjernst added the Team:Search Meta label for search team label May 4, 2020

matschaffer mentioned this issue Jun 22, 2021

[Monitoring] Handle failed/unavailable CCS requests elastic/kibana#100696

Closed

javanna removed their assignment Feb 21, 2022

javanna added the help wanted adoptme label Feb 21, 2022

javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024

davismcphee mentioned this issue Nov 19, 2024

[Data Views] Mitigate issue where has_es_data check can cause Kibana to hang elastic/kibana#200476

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCS: Should `timeout` parameter be honored? #32678

CCS: Should `timeout` parameter be honored? #32678

spinscale commented Aug 7, 2018

elasticmachine commented Aug 7, 2018

javanna commented Aug 7, 2018

javanna commented Aug 10, 2018

javanna commented Mar 7, 2023

DaveCTurner commented Mar 7, 2023

CCS: Should timeout parameter be honored? #32678

CCS: Should timeout parameter be honored? #32678

Comments

spinscale commented Aug 7, 2018

elasticmachine commented Aug 7, 2018

javanna commented Aug 7, 2018

javanna commented Aug 10, 2018

javanna commented Mar 7, 2023

DaveCTurner commented Mar 7, 2023

CCS: Should `timeout` parameter be honored? #32678

CCS: Should `timeout` parameter be honored? #32678