Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross Cluster search causes UI to hang while getting cross cluster field names/types #167706

Closed
desean1625 opened this issue Sep 29, 2023 · 28 comments · Fixed by #177240
Closed
Assignees
Labels
bug Fixes for quality problems that affect the customer experience :DataDiscovery/fix-it-week impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL.

Comments

@desean1625
Copy link
Contributor

desean1625 commented Sep 29, 2023

Kibana version:
8.6.2

Describe the bug:
Dataview route /api/index_patterns/_fields_for_wildcard causes UI not populate for a long time if cross cluster search has clusters that are slow to respond.
Can a cached version be served while an async process keeps the cache up to date?

@ndmitch311

@desean1625 desean1625 added the bug Fixes for quality problems that affect the customer experience label Sep 29, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Sep 29, 2023
@jughosta jughosta added the Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. label Oct 4, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Oct 4, 2023
@kertal
Copy link
Member

kertal commented Oct 5, 2023

thx for reporting, I'm interested, are there frozen data tiers included in those CCS searches. where do you experience the UI to hang? In Discover when loading or switching data views? Or when trying to add filters? In KQL? Thx!

@desean1625
Copy link
Contributor Author

desean1625 commented Oct 6, 2023

Anywhere kibana is using the dataviews. So everywhere.
Specifically what is happening is in the DataviewsService.get method creates a dataview and it needs to get the fields to build the dataview and calls refreshFieldSpecMap.

If the index pattern is *:index-* it needs to get the fields for a wildcard (here in the data_views_api_client) pattern and pushes this to the server to do the request. Kibana server side then does a cross cluster search and the ui hangs until the results are returned.

This calls utilizes the ElasticSearch client specifically the fieldCapsAPI without any options so the request will take the default requestTimeout of 30000 ms

After that it does a cache the dataview client side. So after the first load it is fast, until you open a new tab or refresh.

If one of the crossclusters doesn't have connectivity or is over a slow connection this call takes 30 seconds every time you select an index for every user (hundreds) every time they refresh, or open a new tab.

What I was hoping is that the results of the field_caps could be cached severside in kibana. This would speed up this call for every user that opens kibana. The server side caching should be easy to implement here

@mattkime
Copy link
Contributor

mattkime commented Oct 6, 2023

We can't serve a cached version because that would potentially circumvent field level security - https://www.elastic.co/guide/en/elasticsearch/reference/current/field-level-security.html

While its not unusual for cross cluster requests to take longer, this sounds extreme. Is the frozen data tier involved?

We currently have a couple of efforts focused on improving worst case field loading scenarios but its hard to tell if this would help your case. Is your problematic cluster just slow? Or is it inconsistent? Its tricky work around an unreliable data source.

@desean1625
Copy link
Contributor Author

The _fields_for_wildcard is constantly takes between 20 and 38s and returns only 12.6 kb.

Our ILM only has hot and warm.
We have a about 8 clusters geographically separated. Each with 20-80 indices linked to the ILM. The number of fields for each index is also high the rollup of all the fields returns 1178 fields for all of the indices and 382 total indices.

@desean1625
Copy link
Contributor Author

We can't serve a cached version because that would potentially circumvent field level security - https://www.elastic.co/guide/en/elasticsearch/reference/current/field-level-security.html

I don't think the field meta data circumvents field level security because no data is being pulled in this call. Unless the field-level-security even prevents the users from knowing if the field even exists.

@desean1625
Copy link
Contributor Author

desean1625 commented Oct 6, 2023

This actually appears to compounded by a deeper issue with the http router handling requests sequentially.

I booted all my users and scaled my kibana instances down to 1. The request for _fields_for_wildcard took exactly 2.3 seconds. I hit that end point a bunch of times and noticed that they were handled sequentially.

image

This means that the more users we have the worse the problem is (which is why the ui consistently hangs between 20-30 seconds) If not longer.

Even waiting the 2.3 seconds to gather the fields from the remote clusters is too long from a ui/user perspective.

@kertal
Copy link
Member

kertal commented Oct 9, 2023

@desean1625 thx for sharing, we are currently aiming to reduce an optimize the request for fields. When you have multiple users, the request for fields should not be handled sequentially by user. The screenshot you were sharing is of your Browsers DevTool, right? in which part of Kibana did you view that pattern of so much requests for the same fields? thx

@desean1625
Copy link
Contributor Author

desean1625 commented Oct 9, 2023

@kertal The screenshots were from the dev tools. It was a manual test to simulate multiple requests.

I created an example repo that does a "stress test" (only 10 requests) to show the router handles the requests sequentially.
https://github.com/desean1625/kibana_router_test

git clone it into kibana/plugins and build the plugin.

@desean1625
Copy link
Contributor Author

Disregard the issue with sequential requests. Issue this was being caused by the browser.
Edge is making the requests sequential if the url parameters haven't changed.
Edge
image
Chrome
image

@mattkime
Copy link
Contributor

mattkime commented Oct 9, 2023

Unless the field-level-security even prevents the users from knowing if the field even exists.

Unfortunately thats exactly what it does.

@desean1625 What are you doing in kibana that kicks off so many requests?

The number of fields you're using sounds very reasonable and shouldn't be causing performance issues.

Even waiting the 2.3 seconds to gather the fields from the remote clusters is too long from a ui/user perspective.

Absolutely. I would expect much faster times based on your description.

If you're willing, providing a har file that captures the slow loading might be helpful.

@desean1625
Copy link
Contributor Author

You can kick off the requests by clicking on the index pattern in discover. The popover doesn't close out until the request is completed. So you can click multiple times and reinitiate the request. Users can do this because it takes 9-35 seconds for the response.

@davismcphee
Copy link
Contributor

Disregard the issue with sequential requests. Issue this was being caused by the browser.
Edge is making the requests sequential if the url parameters haven't changed.

Yeah, I believe the browsers do this in case the first request returns cache headers in which case subsequent requests should be served from the cache (the odd case where caching is actually slower). But it's an interesting point to raise regardless, because if we have instances in Kibana where X number of the same field caps requests are fired at once (we do, unfortunately), then this makes the problem X times worse. It's not the root cause or solution to this performance issue, but since we know the _fields_for_wildcard endpoint doesn't use caching, we could avoid making the issue worse by including something like a cache busting query param or Cache-Control header when requesting fields, which should cause the requests to run simultaneously at least.

@mattkime
Copy link
Contributor

mattkime commented Oct 9, 2023

@desean1625 So in order to kick off so many requests, you're selecting different data views before the previous one has finished loading?

If that's the case, then we should focus on how we can speed up a particular fields_for_wildcard request

The popover doesn't close out until the request is completed.

Which popover? Can you provide a screenshot?

@davismcphee
Copy link
Contributor

@desean1625 Slow field lists are definitely something that need to be addressed and that we're actively looking to improve, but as an aside I wonder if some of our planned CCS improvements would also be helpful for this use case: #164350. Not all of the plans have been shared publicly yet, but in general we're looking to give users greater control over their clusters such as notifying them of problematic/slow clusters and providing quick options in the UI to exclude them. Just curious if it seems like these types of improvements could help the issue from a slightly different angle?

@desean1625
Copy link
Contributor Author

@desean1625 Slow field lists are definitely something that need to be addressed and that we're actively looking to improve, but as an aside I wonder if some of our planned CCS improvements would also be helpful for this use case: #164350. Not all of the plans have been shared publicly yet, but in general we're looking to give users greater control over their clusters such as notifying them of problematic/slow clusters and providing quick options in the UI to exclude them. Just curious if it seems like these types of improvements could help the issue from a slightly different angle?

Yes I believe these planned improvements will help because it is fully integrating some of the plugins we have created. Specifically from our "Advanced cross cluster search" plugin that allows users to globally turn off specific clusters is being implemented in this ticket #99100.

Our implementation didn't cover all cases because it was just a hook into the searchInterceptor and not everything is routed through @kbn/data-plugin namely TSVB, and other routes that do serverside requests like /api/index_patterns/_fields_for_wildcard

@desean1625
Copy link
Contributor Author

Which popover? Can you provide a screenshot?

image

@desean1625
Copy link
Contributor Author

@mattkime if you want to simulate what our experience is like change this line to the following

  await new Promise(r=> setTimeout(r,(Math.floor(Math.random()*30)+2)*1000))
  const { fields, indices } = await indexPatterns.getFieldsForWildcard({

@mattkime
Copy link
Contributor

@desean1625 I think the quickest way to improve your setup is to learn why the field_caps requests are slow. It would be helpful if you could use the kibana dev tools to verify that direct requests to ES take about the same amount of time as the fields_for_wildcard responses via the kibana dev console -

GET /{index-pattern}/_field_caps?fields=<fields>

@desean1625
Copy link
Contributor Author

desean1625 commented Oct 11, 2023

@mattkime
curl -k -o /dev/null -s https://elastic:changeme@mycluster:9200/*:myindex-*/_field_caps?fields=* -w "%{time_total}"
1.434182

GET /*:myindex-*/_field_caps?fields=*
4916ms-16305ms

Just note that from my localbox the endpoint to elastic isn't exposed so the curl was run from the same server that hosts kibana so the curl is one less hop. but for 19kb I wouldn't expect a significant difference

@kertal
Copy link
Member

kertal commented Oct 13, 2023

Which popover? Can you provide a screenshot?

image

@desean1625 we're working on that: #167221

@kertal
Copy link
Member

kertal commented Oct 17, 2023

@mattkime curl -k -o /dev/null -s https://elastic:changeme@mycluster:9200/*:myindex-*/_field_caps?fields=* -w "%{total_time}" 1.434182

GET /*:myindex-*/_field_caps?fields=* 4916ms-16305ms

Just note that from my localbox the endpoint to elastic isn't exposed so the curl was run from the same server that hosts kibana so the curl is one less hop. but for 19kb I wouldn't expect a significant difference

So when you're running a curl directly on the Kibana server, it just takes 1.5s vs when you run via Console on Kibana in the Browser it takes 4-16s? There shouldn't be so much difference in this case. It's clear that curl in this case is faster, because as you said it's one less hop. Could you look at the Timing of the request in the Browser's dev tools. It's the proxy request in the Network tab. it would be interesting how fast the server responds, how log the Content Download takes. To get more insights about the communication between Kibana and your Browser

Console_-_Dev_Tools_-_Elastic

@desean1625
Copy link
Contributor Author

GET /*:myindex-*/_field_caps?fields=*
the timing is as follows

request sent 1.01 ms
Waiting for server response 36.01s
Content Download 58.39ms

second run

request sent 2.64 ms
Waiting for server response 16.21s
Content Download 16.22ms

third run

request sent 1.41 ms
Waiting for server response 2.32s
Content Download 38.98ms

fourth run

request sent 0.96 ms
Waiting for server response 6.68s
Content Download 20.63ms

@kertal
Copy link
Member

kertal commented Oct 19, 2023

Thx for sharing, so there seems to be a wide range of response times, but this isn't unfortunately something we can fix, this the request is sent to your CCS cluster, and it takes so long until all fields of all CCS instances are returned. Let's aim to fix what you reported initially, I've created an issue for that #169360 and #167221 should make switching data views fast again.

@desean1625
Copy link
Contributor Author

Initial request is to cache the fields response. So, the users doesn't have to wait up to 30 seconds when adding a map layer or switching data views, or when trying to build a lens visualization. This is an issue that plagues all of Kibana and make it difficult to use. Users cannot do anything but wait.

@desean1625
Copy link
Contributor Author

Here is the basic concept for caching it accounts for the current user and their associated roles. While always trying to keep the cache current. Maybe you could cache it as a stored object instead of keeping it in memory?
https://gist.github.com/desean1625/cb6e019a6a3e137468918eef0ad5211d

@mattkime
Copy link
Contributor

Maybe you could cache it as a stored object instead of keeping it in memory?

We can't do that because of security concerns - different users may get different field lists.

I'm exploring caching field requests based on http headers - #168910

@davismcphee davismcphee added loe:large Large Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed feedback_needed labels Jan 4, 2024
@kertal
Copy link
Member

kertal commented Feb 15, 2024

I took another look at this and while I think we already addressed mostly what was discussed here, one thing is missing.
When initially a very slow data view is fetching its fields, we don't display any loading indication, so the interface doesn't change, and the user will wonder what is happening.

Luckily this shouldn't be to complicated to address:'

services,
internalState,
appState,
}: {
services: DiscoverServices;
internalState: DiscoverInternalStateContainer;
appState: DiscoverAppStateContainer;
}
) {
addLog('[ui] changeDataView', { id });
const { dataViews, uiSettings } = services;
const dataView = internalState.getState().dataView;
const state = appState.getState();
let nextDataView: DataView | null = null;
try {
nextDataView = typeof id === 'string' ? await dataViews.get(id, false) : id;
} catch (e) {
//
}

Before the new data view is requested, we should show Discover's loading state,

@kertal kertal added loe:small Small Level of Effort :DataDiscovery/fix-it-week and removed loe:large Large Level of Effort labels Feb 15, 2024
@kertal kertal self-assigned this Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience :DataDiscovery/fix-it-week impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants