[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang #200476

davismcphee · 2024-11-18T01:58:34Z

Summary

This PR mitigates an issue where the has_es_data check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue:

The resolve/cluster request in the has_es_data endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request.
In cases where there is no local data and the remote resolve/cluster request hangs, a new data_views.hasEsDataTimeout config has been added to kibana.yml (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.

Fixes #200280.

Notes

Modifying the existing version of the has_es_data endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review).
For a long term fix, the ES team is investigating the issue with resolve/cluster and will aim to have it behave like resolve/index, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: [Resolve Clusters API] Add option to configure cluster timeout elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible.
If ES ends up providing another performant method for checking if indices exist instead of resolve/cluster, Kibana should migrate to that. More details in Need performant method of determining whether there are indices elasticsearch#112307.

Testing notes

To reproduce the issue locally, follow these steps:

Follow these instructions to set up a local CCS environment.
Stop the remote cluster process.
Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: nc -l 9600), simulating an unresponsive cluster. See CCS: Should timeout parameter be honored? elasticsearch#32678 for more context.
Navigate to Discover and observe that the has_es_data request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast.

Checklist

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
Flaky Test Runner was used on any tests changed
The PR description includes the appropriate Release Notes section, and the correct release_node:* label is applied per the guidelines

elasticmachine · 2024-11-19T05:31:10Z

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

lukasolson

Left a couple of minor notes below

lukasolson · 2024-11-19T18:10:44Z

src/plugins/data_views/public/services/has_data.ts

+          e.body?.statusCode === 400 &&
+          e.body?.attributes?.failureReason === HasEsDataFailureReason.remoteDataTimeout
+        ) {
+          core.notifications.toasts.addDanger({


From an API perspective, is it possible that consumers will want to swallow toasts/error messages? Does it make sense to have consumers pass in a onRemoteDataTimeout function that defaults to this behavior, but would also allow consumers to handle it in different ways?

(I think toasts are a decent default behavior but I don't think every possible consumer of this API will want to show a toast in these error scenarios.)

Yeah I think that's reasonable. It makes sense to provide a way to override it in cases where consumers have a better way to handle it. Updated here: 47965df.

lukasolson · 2024-11-19T18:18:56Z

src/plugins/data_views/public/services/has_data.ts

@@ -82,6 +106,9 @@ export class HasData {

  // ES Data

+  private isResponseError = (e: any): e is IHttpFetchError<ResponseErrorBody> =>


nit: Don't need to use any

Suggested change

private isResponseError = (e: any): e is IHttpFetchError<ResponseErrorBody> =>

private isResponseError = (e: Error): e is IHttpFetchError<ResponseErrorBody> =>

True! Updated: 079588b.

lukasolson · 2024-11-19T18:24:33Z

src/plugins/data_views/server/rest_api_routes/internal/has_es_data.test.ts

lukasolson · 2024-11-19T19:29:27Z

src/plugins/data_views/server/rest_api_routes/internal/has_es_data.ts

+      return res.badRequest({
+        body: {
+          message: timeoutMessage,
+          attributes: { failureReason: timeoutReason },
+        },
+      });


I'm not sure a "bad request" response makes sense here... The client didn't send anything wrong. Maybe we an do a custom error with a 408 status code?

Suggested change

return res.badRequest({

body: {

message: timeoutMessage,

attributes: { failureReason: timeoutReason },

},

});

return res.customError({

body: {

statusCode: 408,

message: timeoutMessage,

attributes: { failureReason: timeoutReason },

},

});

I agree, 400 isn't really appropriate for this. I originally looked at 408 too, but my understanding is that it's used when a server times out waiting on a request from the client, not when it times out trying to return a response. After reading into it a bit more, I feel like 504 Gateway Timeout might be most appropriate for this case, so I updated it here: f130736.

I was hoping to avoid a generic 500 since it may be misleading and look like a Kibana server failure, but we could instead just go with that if 504 doesn't seem good either.

lukasolson · 2024-11-19T19:35:08Z

src/plugins/data_views/server/rest_api_routes/internal/has_es_data.ts

+    return res.badRequest({
+      body: {
+        message: errorMessage,
+        attributes: { failureReason: HasEsDataFailureReason.unknown },
+      },
+    });


This one I'm not sure what to do... We can probably leave as is. Are there any known cases we might fail here? If so, we might want to check e.meta.statusCode and use it.

Yeah, also not a good case for 400. And nope, no known cases... Which makes me realize it probably makes sense to just return a 500 here since it's unexpected. I think this is good enough for the client, and we log the underlying error if needed for further investigation. Updated here: f130736.

…aTimeout

davismcphee

@lukasolson Thanks for the feedback, and I made some updates.

davismcphee · 2024-11-20T01:30:18Z

src/plugins/data_views/public/services/has_data.ts

+          e.body?.statusCode === 400 &&
+          e.body?.attributes?.failureReason === HasEsDataFailureReason.remoteDataTimeout
+        ) {
+          core.notifications.toasts.addDanger({


Yeah I think that's reasonable. It makes sense to provide a way to override it in cases where consumers have a better way to handle it. Updated here: 47965df.

davismcphee · 2024-11-20T01:30:40Z

src/plugins/data_views/public/services/has_data.ts

@@ -82,6 +106,9 @@ export class HasData {

  // ES Data

+  private isResponseError = (e: any): e is IHttpFetchError<ResponseErrorBody> =>


True! Updated: 079588b.

davismcphee · 2024-11-20T01:36:58Z

src/plugins/data_views/server/rest_api_routes/internal/has_es_data.ts

+      return res.badRequest({
+        body: {
+          message: timeoutMessage,
+          attributes: { failureReason: timeoutReason },
+        },
+      });


I agree, 400 isn't really appropriate for this. I originally looked at 408 too, but my understanding is that it's used when a server times out waiting on a request from the client, not when it times out trying to return a response. After reading into it a bit more, I feel like 504 Gateway Timeout might be most appropriate for this case, so I updated it here: f130736.

I was hoping to avoid a generic 500 since it may be misleading and look like a Kibana server failure, but we could instead just go with that if 504 doesn't seem good either.

davismcphee · 2024-11-20T01:40:17Z

src/plugins/data_views/server/rest_api_routes/internal/has_es_data.ts

+    return res.badRequest({
+      body: {
+        message: errorMessage,
+        attributes: { failureReason: HasEsDataFailureReason.unknown },
+      },
+    });


Yeah, also not a good case for 400. And nope, no known cases... Which makes me realize it probably makes sense to just return a 500 here since it's unexpected. I think this is good enough for the client, and we log the underlying error if needed for further investigation. Updated here: f130736.

elasticmachine · 2024-11-20T03:12:46Z

💚 Build Succeeded

Buildkite Build
Commit: f130736

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`dataViews`	53	55	+2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`dataViews`	1.9KB	1.9KB	-1.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`dataViews`	61.6KB	62.6KB	+999.0B

Unknown metric groups

API count

id	before	after	diff
`dataViews`	1224	1225	+1

ESLint disabled line counts

id	before	after	diff
`dataViews`	12	13	+1

Total ESLint disabled count

id	before	after	diff
`dataViews`	14	15	+1

History

💛 Build #252217 was flaky af0148d
💔 Build #252213 failed 8433e1b
💚 Build #251750 succeeded 958cf78
💔 Build #251748 failed 6480523

cc @davismcphee

lukasolson

Latest changes LGTM!

kibanamachine · 2024-11-20T18:53:08Z

Starting backport for target branches: 8.15, 8.16, 8.x

https://github.com/elastic/kibana/actions/runs/11939870117

…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs. ![CleanShot 2024-11-18 at 23 47 34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043) Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]> (cherry picked from commit 96fd4b6)

kibanamachine · 2024-11-20T18:58:27Z

💚 All backports created successfully

Status	Branch	Result
✅	8.15
✅	8.16
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

… can cause Kibana to hang (#200476) (#201025) # Backport This will backport the following commits from `main` to `8.x`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Davis McPhee <[email protected]>

…k can cause Kibana to hang (#200476) (#201024) # Backport This will backport the following commits from `main` to `8.16`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Davis McPhee <[email protected]>

…k can cause Kibana to hang (#200476) (#201023) # Backport This will backport the following commits from `main` to `8.15`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  --------- Co-authored-by: Davis McPhee <[email protected]>

mistic · 2024-11-21T16:54:05Z

This PR didn't make it into the latest BC of v8.16.1. Updating the labels.

…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs. ![CleanShot 2024-11-18 at 23 47 34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043) Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>

davismcphee added release_note:fix Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. backport:prev-major Backport to (8.x, 8.17, 8.16) the previous major branch and other branches in development labels Nov 18, 2024

davismcphee self-assigned this Nov 18, 2024

davismcphee and others added 3 commits November 18, 2024 16:04

Mitigate issue where has_es_data check can cause Kibana to hang

a7da40a

[CI] Auto-commit changed files from 'node scripts/notice'

c47688d

Refine timeout logic

6d29a2a

davismcphee force-pushed the fix-has-es-data-hanging branch from 958cf78 to 6d29a2a Compare November 19, 2024 02:23

davismcphee and others added 3 commits November 18, 2024 23:40

Add tests

8433e1b

[CI] Auto-commit changed files from 'node scripts/notice'

0657960

Fix data_views.hasEsDataTimeout config key

af0148d

davismcphee marked this pull request as ready for review November 19, 2024 05:31

davismcphee requested a review from a team as a code owner November 19, 2024 05:31

lukasolson reviewed Nov 19, 2024

View reviewed changes

davismcphee added 4 commits November 19, 2024 20:12

Merge branch 'main' into fix-has-es-data-hanging

6543761

Allow overriding hasEsData remote data timeout toast with onRemoteDat…

47965df

…aTimeout

Remove any

079588b

Update error statuses

f130736

davismcphee commented Nov 20, 2024

View reviewed changes

lukasolson approved these changes Nov 20, 2024

View reviewed changes

davismcphee merged commit 96fd4b6 into elastic:main Nov 20, 2024
25 checks passed

davismcphee deleted the fix-has-es-data-hanging branch November 20, 2024 18:52

kibanamachine added the v9.0.0 label Nov 20, 2024

kibanamachine mentioned this pull request Nov 20, 2024

[8.15] [Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476) #201023

Merged

kibanamachine mentioned this pull request Nov 20, 2024

[8.16] [Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476) #201024

Merged

kibanamachine mentioned this pull request Nov 20, 2024

[8.x] [Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476) #201025

Merged

kibanamachine added the v8.17.0 label Nov 20, 2024

kibanamachine added the v8.16.1 label Nov 20, 2024

kibanamachine added the v8.15.5 label Nov 20, 2024

mistic added v8.16.2 and removed v8.16.1 labels Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang #200476

[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang #200476

davismcphee commented Nov 18, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Nov 19, 2024

lukasolson left a comment

lukasolson Nov 19, 2024

davismcphee Nov 20, 2024

lukasolson Nov 19, 2024

davismcphee Nov 20, 2024

lukasolson Nov 19, 2024

lukasolson Nov 19, 2024

davismcphee Nov 20, 2024

lukasolson Nov 19, 2024

davismcphee Nov 20, 2024

davismcphee left a comment

davismcphee Nov 20, 2024

davismcphee Nov 20, 2024

davismcphee Nov 20, 2024

davismcphee Nov 20, 2024

elasticmachine commented Nov 20, 2024

API count

ESLint disabled line counts

Total ESLint disabled count

lukasolson left a comment

kibanamachine commented Nov 20, 2024

kibanamachine commented Nov 20, 2024

mistic commented Nov 21, 2024

		@@ -82,6 +106,9 @@ export class HasData {

		// ES Data

		private isResponseError = (e: any): e is IHttpFetchError<ResponseErrorBody> =>

	private isResponseError = (e: any): e is IHttpFetchError<ResponseErrorBody> =>
	private isResponseError = (e: Error): e is IHttpFetchError<ResponseErrorBody> =>

[Data Views] Mitigate issue where has_es_data check can cause Kibana to hang #200476

[Data Views] Mitigate issue where has_es_data check can cause Kibana to hang #200476

Conversation

davismcphee commented Nov 18, 2024 • edited by kibanamachine Loading

Summary

Notes

Testing notes

Checklist

elasticmachine commented Nov 19, 2024

lukasolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davismcphee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Nov 20, 2024

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

Page load bundle

API count

ESLint disabled line counts

Total ESLint disabled count

History

lukasolson left a comment

Choose a reason for hiding this comment

kibanamachine commented Nov 20, 2024

kibanamachine commented Nov 20, 2024

💚 All backports created successfully

Questions ?

mistic commented Nov 21, 2024

[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang #200476

[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang #200476

davismcphee commented Nov 18, 2024 •

edited by kibanamachine

Loading