Data views usage collection causes large amount of Elasticsearch queries #151064

rudolf · 2023-02-13T19:25:35Z

Data views usage collection causes several Elasticsearch queries to be run per data view. When clusters have a large amount of data views this can cause a spike in connections to Elasticsearch, large increases in memory consumption and event loop delays ultimately leading to degraded performance. In one example there were 4689 Elasticsearch queries generated by 1618 data views.

These requests include capturing new telemetry from the saved objects resolve API #112025 and core-usage-stats #85706 both introduced by sharing to multiple spaces #113743

The impact of this is further exacberated by the fact that the maps plugin will also iterate over all data views for it's usage collection.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-02-13T19:25:38Z

Pinging @elastic/kibana-core (Team:Core)

elasticmachine · 2023-02-13T19:25:38Z

Pinging @elastic/kibana-presentation (Team:Presentation)

elasticmachine · 2023-02-13T19:25:38Z

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

nreese · 2023-02-13T19:41:52Z

elastic/elasticsearch#83636 updated Elasticsearch fieldcaps endpoint to allow filtering by field type. Maps plugin could explore using a single elasticsearch fieldcaps request to find geo_point and geo_shape indices instead of iterating over each data view.

For example, maps telemetry could just call GET /_field_caps?fields=*&types=geo_point,geo_shape.

Questions:

will this usage of field caps effect cluster performance
is there paging of field caps response to handle large responses
Maybe request could also include indices parameter to only look through X number of indices at a time?

…#151072) While investigating #151064, I found a problem with IndexPatternStatsCollector where geo_shape aggregation usage with vector tile layers are not counted. Steps to view problem: * Download [world countries geojson](https://vector.maps.elastic.co/files/world_countries_v7.geo.json?elastic_tile_service_tos=agree&my_app_name=ems-landing-page&my_app_version=8.6.0&license=643c1faf-80fc-4ab0-9323-4d9bd11f4bbc) * use file upload to upload world countries into your Elastic stack * add a new cluster layer to your map. * Select world countries index * Select **Hexagons** * Click **Add layer** * Save map * Open borwser dev tools and switch to network tab * Open Kibana dev tools and run ``` POST kbn:api/telemetry/v2/clusters/_stats { "unencrypted": true } ``` * Copy response for `_stats` request. Search for `geoShapeAggLayersCount`. Notice how the value is zero but it should be one since you have one map using geo shape aggregation <img width="600" alt="Screen Shot 2023-02-13 at 1 14 34 PM" src="https://user-images.githubusercontent.com/373691/218565153-0060dd4b-e422-477f-8b07-9f4dabd73064.png"> PR resolves the problem by removing layer type guard. The guard is error prone and easy to not update with new layer types. The guard does not provide any value, since the logic is really concerned with source types and the source type guards provide the correct protections. Steps to test: Follow steps above and verify `geoShapeAggLayersCount` is one

…#151072) While investigating #151064, I found a problem with IndexPatternStatsCollector where geo_shape aggregation usage with vector tile layers are not counted. Steps to view problem: * Download [world countries geojson](https://vector.maps.elastic.co/files/world_countries_v7.geo.json?elastic_tile_service_tos=agree&my_app_name=ems-landing-page&my_app_version=8.6.0&license=643c1faf-80fc-4ab0-9323-4d9bd11f4bbc) * use file upload to upload world countries into your Elastic stack * add a new cluster layer to your map. * Select world countries index * Select **Hexagons** * Click **Add layer** * Save map * Open borwser dev tools and switch to network tab * Open Kibana dev tools and run ``` POST kbn:api/telemetry/v2/clusters/_stats { "unencrypted": true } ``` * Copy response for `_stats` request. Search for `geoShapeAggLayersCount`. Notice how the value is zero but it should be one since you have one map using geo shape aggregation <img width="600" alt="Screen Shot 2023-02-13 at 1 14 34 PM" src="https://user-images.githubusercontent.com/373691/218565153-0060dd4b-e422-477f-8b07-9f4dabd73064.png"> PR resolves the problem by removing layer type guard. The guard is error prone and easy to not update with new layer types. The guard does not provide any value, since the logic is really concerned with source types and the source type guards provide the correct protections. Steps to test: Follow steps above and verify `geoShapeAggLayersCount` is one (cherry picked from commit 9d7e109)

…emetry (#151072) (#151089) # Backport This will backport the following commits from `main` to `8.7`: - [[maps] include vector tile layers in geoShapeAggLayersCount telemetry (#151072)](#151072)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Nathan Reese <[email protected]>

mattkime · 2023-02-14T14:50:36Z

@rudolf Do we know what percentage of the performance problem is attributable to the resolve calls?

Past that, it seems that the data views api should provide methods that load only the fields needed for telemetry, similar to what @nreese suggests.

rudolf · 2023-02-16T19:11:07Z

In #151110 @afharo was able to reproduce the problem even with empty fields. Which validates my theory, the problem we observed is because of a spike in network connections not the payload size of these connections. Having a large payload would obviously make this even worse but as a start I think we should try to optimise the amount of outgoing requests.

I "manually" classified the requests using a filter aggregation. These are the requests caused by data views telemetry for a single telemetry usage collection run:

Maps telemetry causes an identical distribution of data views related requests but this time nearly double the amount:

Explanation of the "legend":

"config" is a GET to the UiSettings/Advanced settings config saved object
legacy-url-alias is telemetry to collect that resolve was used Add SavedObjectsClient.bulkResolve #112025
core-usage-stats is telemetry collected by all saved objects repository methods Update core usage stats collection #85706
_field_caps requests

I can't really make sense of these numbers it seems like we do 1 of each, but why 311?

By doing a breakdown by trace.id I compared the requests generated for different telemetry collection requests:

Perhaps at different times different amounts of the data views are already cached? Or perhaps the server gets overloaded and there's a timeout before all the requests are able to complete? I can't really explain the differences here.

But fundamentally there's no batching. Each index pattern is loaded individually. If we loaded batches of 100 we'd probably have 100x less requests of (1), (2), (3). Not sure if we want to batch _field_caps requests as that could cause a large payload response.

nreese · 2023-02-16T19:23:29Z

Not sure if we want to batch _field_caps requests as that could cause a large payload response.

Maps telemetry is calculating the total number of indices with geospatial fields. Batching _field_caps requests with field type filtering for geo_point and geo_shape would work really well for this use case.

rudolf · 2023-02-16T19:34:47Z

Something I don't yet understand is https://github.com/elastic/kibana/pull/85706/files added telemetry to the HTTP saved objects API, so internal calls to the server-side saved objects client shouldn't cause this core-usage-stats telemetry writes.

mattkime · 2023-02-17T05:51:31Z

If I'm understanding this correctly -

config - It would be nice to find a way to load config less frequently. I'm curious how this works on the server. This would remove approx 1/9th of the requests.
legacy-url-alias - side effect of using resolve. using get would remove approximately 2/9ths of the requests.
core-usage-stats - not a lot to be done about this one.
_field_caps - I think I can cut these in half since the data views telemetry doesn't need the regular field list. That would get rid of another 1/9th. I haven't found a method of batching field_caps.

Finally, using bulk get or bulk resolve would mostly eliminate 2 and 3 requests, or nearly half of the requests. Might reduce config requests as well.

Overall it seems that we can cut the number of requests in half. Is this good enough? I don't think it is. Looking at the SDH, I can see extended periods where kibana was impacted - perhaps a couple of hours.

Is there other telemetry collection code that fires off a lot of requests? Do we need to take a wider view?

setTimeout could be used to delay creating new requests but I'm not sure thats the right choice if telemetry is already taking a couple of hours to run.

What is it about making requests thats CPU or memory intensive? Am I making too much of the spikes in the resource utilization in the SDH?

rudolf · 2023-02-17T22:17:14Z

so internal calls to the server-side saved objects client shouldn't cause this core-usage-stats telemetry writes.

I had a look at the payload and this is not actually core-usage-data from the HTTP APIs https://github.com/elastic/kibana/pull/85706/files it's additional telemetry for resolve from #112025. So resolve is causing both (2) and (3).

POST /.kibana_8.5.3/_update/core-usage-stats%3Acore-usage-stats?refresh=false&require_alias=true
{"script":{"source":"\n              for (int i = 0; i < params.counterFieldNames.length; i++) {\n                def counterFieldName = params.counterFieldNames[i];\n                def count = params.counts[i];\n\n                if (ctx._source[params.type][counterFieldName] == null) {\n                  ctx._source[params.type][counterFieldName] = count;\n                }\n                else {\n                  ctx._source[params.type][counterFieldName] += count;\n                }\n              }\n              ctx._source.updated_at = params.time;\n            ","lang":"painless","params":{"counts":[1,1],"counterFieldNames":["savedObjectsRepository.resolvedOutcome.exactMatch","savedObjectsRepository.resolvedOutcome.total"],"time":"2023-02-08T13:18:24.782Z","type":"core-usage-stats"}},"upsert":{"core-usage-stats":{"savedObjectsRepository.resolvedOutcome.exactMatch":1,"savedObjectsRepository.resolvedOutcome.total":1},"type":"core-usage-stats","migrationVersion":{"core-usage-stats":"7.14.1"},"coreMigrationVersion":"8.5.3","updated_at":"2023-02-08T13:18:24.782Z"},"_source":true}

rudolf · 2023-02-17T22:29:38Z

In my legend explanation I left out the MGET's that retrieve one data view at a time.

If we ignore _field_caps requests there are 120+3113+451+7203=3664 data view related requests. As a worst case scenario I'm assuming we don't share a data view cache between data view telemetry and maps, so each of these collectors refresh all the data views (even though the numbers suggest this is not true).

If we get batches of 1000 data views it means to refresh 1600 data views would cost 2 batches:
1 x GET config, 2 x MGET data views batch

So worst case assuming a cold cache for both collectors we'd reduce the 3884 requests to 6 requests.

In the worst case cold-cache scenario there would still be an additional 3200 _field_caps requests from data views + maps. And here batching might not be as useful as it could easily cause a very large response payload when there's many fields which comes with it's own problems.

rudolf · 2023-02-17T22:43:36Z

What is it about making requests thats CPU or memory intensive?

Because we use HTTPS, requests are fairly CPU intensive. Each requests also requires a file descriptor for it's socket and other memory to keep the state of the SSL connection. So in my experience any time Kibana does a sudden spike of > 1000 requests we'll block the event loop. Once established the CPU overhead is minimal, so ramping up 1-1000 connections over say 30s still consumes memory but doesn't impact the event loop as much.

This does not mean SSL is the only bottleneck. It could very well be that the processing on the data views themselves are also contributing. I can show you how to profile this to accurately identify the CPU blockers.

Specifically for _field_caps requests, before spending too much time trying to optimise it, I think it's worth validating the business value of this metric. Are we really making better decisions because we know this data? Isn't this already collected in ES data collection?

If we do need this data then we will have to find a way to not collect it during usage collection but more asynchronously. One way would be a background task, but I also wonder if we want to refresh _field_caps for all data views every day. Can we only collect the field_caps for data views when they are loaded by a user and then store the counts on the data view saved object? Or maybe we can only collect field_caps for data views used in the last 24 hours, or the 100 most recently used data views?

mattkime · 2023-02-18T04:17:02Z

It could very well be that the processing on the data views themselves are also contributing. I can show you how to profile this to accurately identify the CPU blockers.

Thats very well worth verifying.

Are we really making better decisions because we know this data?

This is a question for @nreese as I think the data view telemetry could skip loading fields.

Isn't this already collected in ES data collection?

I guess it might be. Where can I find info on this? I didn't even know it existed.

If we do need this data then we will have to find a way to not collect it during usage collection but more asynchronously.

My idea is to produce a new index which roughly has documents 1:1 with data views. This way we should be able to get the info we need with a query, but hopefully we can avoid this.

nreese · 2023-02-19T15:42:21Z

Are we really making better decisions because we know this data?

That is a question for product. From our team's perspective it is helpful to know how many clusters have geospatial data and don't use maps.

In the worst case cold-cache scenario there would still be an additional 3200 _field_caps requests from data views + maps. And here batching might not be as useful as it could easily cause a very large response payload when there's many fields which comes with it's own problems.

In the maps use case, _field_caps request could be reduced by fetching geo_point and geo_shape fields for multiple data views in a single request. The response size should stay limited because of the field type filter.

sophiec20 · 2023-02-20T10:07:25Z

There is a lot of discussion regarding optimizing telemetry collection - yes, it is right that all telemetry collection should be efficient.

Should we collect this in the first place - yes, it is right we should only collect telemetry that is required to improve our service.

However, more importantly, do we have guardrails and circuit breakers for this collection? There is always a cluster out there pushing the bounds of object creation. Therefore, it is always better to drop telemetry data, than to let this collection process run at all costs. Fixing individual telemetry collectors is a whack-a-mole approach. Do we have overall breakers?

afharo · 2023-02-20T12:39:06Z

Digging through maps' collector code, it looks like https://github.com/elastic/kibana/pull/147825/files#diff-956293b485c600c77352026d223c498055f37cf7bf78ad45bf6808795dcfa2fcR882 might reduce a good amount of pressure 🎉

IIUC, that API should cache the fields and reduce many requests, wouldn't it?

However, more importantly, do we have guardrails and circuit breakers for this collection? There is always a cluster out there pushing the bounds of object creation. Therefore, it is always better to drop telemetry data, than to let this collection process run at all costs. Fixing individual telemetry collectors is a whack-a-mole approach. Do we have overall breakers?

FWIW, we are looking for options to circuit-break those rogue collectors when possible. All the solutions we've found so far tend to have workarounds that the usage collectors could exploit:

Make it slow: Telemetry is reported once per day. We could generate it slowly and ship it when ready. We could even run one collector at a time. However, it wouldn't solve an issue like one collector making 3-5k concurrent requests.
Limiting the concurrency of requests: Usage Collectors receive a SO client and an ES client. We could tweak those to limit the number of concurrent requests (and even throttle abusers). However, many collectors use their externally injected clients (different reasons may apply here). maps is an example.
- We've identified that we could leverage the Execution Context to determine which requests are used for telemetry collection (even for custom clients). However, applying throttles based on that may require a level of effort high enough to consider other options first.
Cancelling collectors that generate too much load: We could apply timeouts to the collectors (or, somehow, measure their impact on the event loop delay) and cancel their promises when they exceed a limit. However, AFAIK, unless those promises have a proper abort signal, they'll still consume resources until fulfilled.

TBH, I think it's important to keep in mind that the Telemetry Snapshot is reported once daily. For this reason, we tend to rely a lot on caching (if the report is generated but we fail to ship it, we should reuse the generated report). While it's not ideal that there's a daily spike that could momentarily freeze Kibana in some scenarios, I think collecting metrics and improving those over time is the way to go.

AFAIK the primary cause of SDHs is that external actors are abusing the telemetry endpoints to generate the report (with refreshCache: true). The last known case was another product trying to fetch the opt-in status of Kibana by fetching the entire report and looking at telemetry.optIn: true|false. Another previously known use case used to be ESS fetching the entire report to understand the usage of reporting. IMO, circuit breakers could be applied to "hide" the APIs from external actors. In the past, we did #96538, and we are close to resolving #150429.

Having said that, I'll keep thinking about ways to circuit-break any rogue collectors that might go out of hand.

jsanz · 2023-03-09T16:31:19Z

Maps application has been updated to remove data view telemetry #152124

mattkime · 2023-03-23T04:16:29Z

#152298 greatly reduces this problem. merged as of 8.7

thomasneirynck · 2023-12-07T18:47:46Z

@rudolf Is this still relevant after #151064 (comment) ?

rudolf · 2023-12-18T14:47:21Z

Yeah I think with #151064 and the performance tests from #151110 it's safe to close this.

nreese mentioned this issue Feb 13, 2023

[maps] include vector tile layers in geoShapeAggLayersCount telemetry #151072

Merged

This was referenced Feb 15, 2023

[data views] telemetry is too resource intensive, skip field loading #151248

Closed

[data views] large field caps responses can block main thread, reducing responsiveness #151249

Closed

jsanz removed the Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas label Mar 9, 2023

kertal removed the Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. label Mar 23, 2023

rudolf closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data views usage collection causes large amount of Elasticsearch queries #151064

Data views usage collection causes large amount of Elasticsearch queries #151064

rudolf commented Feb 13, 2023

elasticmachine commented Feb 13, 2023

elasticmachine commented Feb 13, 2023

elasticmachine commented Feb 13, 2023

nreese commented Feb 13, 2023

mattkime commented Feb 14, 2023

rudolf commented Feb 16, 2023

nreese commented Feb 16, 2023

rudolf commented Feb 16, 2023

mattkime commented Feb 17, 2023

rudolf commented Feb 17, 2023

rudolf commented Feb 17, 2023

rudolf commented Feb 17, 2023 •

edited

Loading

mattkime commented Feb 18, 2023 •

edited

Loading

nreese commented Feb 19, 2023

sophiec20 commented Feb 20, 2023 •

edited

Loading

afharo commented Feb 20, 2023

jsanz commented Mar 9, 2023

mattkime commented Mar 23, 2023

thomasneirynck commented Dec 7, 2023

rudolf commented Dec 18, 2023

Data views usage collection causes large amount of Elasticsearch queries #151064

Data views usage collection causes large amount of Elasticsearch queries #151064

Comments

rudolf commented Feb 13, 2023

elasticmachine commented Feb 13, 2023

elasticmachine commented Feb 13, 2023

elasticmachine commented Feb 13, 2023

nreese commented Feb 13, 2023

mattkime commented Feb 14, 2023

rudolf commented Feb 16, 2023

nreese commented Feb 16, 2023

rudolf commented Feb 16, 2023

mattkime commented Feb 17, 2023

rudolf commented Feb 17, 2023

rudolf commented Feb 17, 2023

rudolf commented Feb 17, 2023 • edited Loading

mattkime commented Feb 18, 2023 • edited Loading

nreese commented Feb 19, 2023

sophiec20 commented Feb 20, 2023 • edited Loading

afharo commented Feb 20, 2023

jsanz commented Mar 9, 2023

mattkime commented Mar 23, 2023

thomasneirynck commented Dec 7, 2023

rudolf commented Dec 18, 2023

rudolf commented Feb 17, 2023 •

edited

Loading

mattkime commented Feb 18, 2023 •

edited

Loading

sophiec20 commented Feb 20, 2023 •

edited

Loading