-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data views usage collection causes large amount of Elasticsearch queries #151064
Comments
Pinging @elastic/kibana-core (Team:Core) |
Pinging @elastic/kibana-presentation (Team:Presentation) |
Pinging @elastic/kibana-data-discovery (Team:DataDiscovery) |
elastic/elasticsearch#83636 updated Elasticsearch fieldcaps endpoint to allow filtering by field type. Maps plugin could explore using a single elasticsearch fieldcaps request to find geo_point and geo_shape indices instead of iterating over each data view. For example, maps telemetry could just call Questions:
|
…#151072) While investigating #151064, I found a problem with IndexPatternStatsCollector where geo_shape aggregation usage with vector tile layers are not counted. Steps to view problem: * Download [world countries geojson](https://vector.maps.elastic.co/files/world_countries_v7.geo.json?elastic_tile_service_tos=agree&my_app_name=ems-landing-page&my_app_version=8.6.0&license=643c1faf-80fc-4ab0-9323-4d9bd11f4bbc) * use file upload to upload world countries into your Elastic stack * add a new cluster layer to your map. * Select world countries index * Select **Hexagons** * Click **Add layer** * Save map * Open borwser dev tools and switch to network tab * Open Kibana dev tools and run ``` POST kbn:api/telemetry/v2/clusters/_stats { "unencrypted": true } ``` * Copy response for `_stats` request. Search for `geoShapeAggLayersCount`. Notice how the value is zero but it should be one since you have one map using geo shape aggregation <img width="600" alt="Screen Shot 2023-02-13 at 1 14 34 PM" src="https://user-images.githubusercontent.com/373691/218565153-0060dd4b-e422-477f-8b07-9f4dabd73064.png"> PR resolves the problem by removing layer type guard. The guard is error prone and easy to not update with new layer types. The guard does not provide any value, since the logic is really concerned with source types and the source type guards provide the correct protections. Steps to test: Follow steps above and verify `geoShapeAggLayersCount` is one
…#151072) While investigating #151064, I found a problem with IndexPatternStatsCollector where geo_shape aggregation usage with vector tile layers are not counted. Steps to view problem: * Download [world countries geojson](https://vector.maps.elastic.co/files/world_countries_v7.geo.json?elastic_tile_service_tos=agree&my_app_name=ems-landing-page&my_app_version=8.6.0&license=643c1faf-80fc-4ab0-9323-4d9bd11f4bbc) * use file upload to upload world countries into your Elastic stack * add a new cluster layer to your map. * Select world countries index * Select **Hexagons** * Click **Add layer** * Save map * Open borwser dev tools and switch to network tab * Open Kibana dev tools and run ``` POST kbn:api/telemetry/v2/clusters/_stats { "unencrypted": true } ``` * Copy response for `_stats` request. Search for `geoShapeAggLayersCount`. Notice how the value is zero but it should be one since you have one map using geo shape aggregation <img width="600" alt="Screen Shot 2023-02-13 at 1 14 34 PM" src="https://user-images.githubusercontent.com/373691/218565153-0060dd4b-e422-477f-8b07-9f4dabd73064.png"> PR resolves the problem by removing layer type guard. The guard is error prone and easy to not update with new layer types. The guard does not provide any value, since the logic is really concerned with source types and the source type guards provide the correct protections. Steps to test: Follow steps above and verify `geoShapeAggLayersCount` is one (cherry picked from commit 9d7e109)
…emetry (#151072) (#151089) # Backport This will backport the following commits from `main` to `8.7`: - [[maps] include vector tile layers in geoShapeAggLayersCount telemetry (#151072)](#151072) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Nathan Reese","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-02-13T22:49:52Z","message":"[maps] include vector tile layers in geoShapeAggLayersCount telemetry (#151072)\n\nWhile investigating #151064, I\r\nfound a problem with IndexPatternStatsCollector where geo_shape\r\naggregation usage with vector tile layers are not counted.\r\n\r\nSteps to view problem:\r\n* Download [world countries\r\ngeojson](https://vector.maps.elastic.co/files/world_countries_v7.geo.json?elastic_tile_service_tos=agree&my_app_name=ems-landing-page&my_app_version=8.6.0&license=643c1faf-80fc-4ab0-9323-4d9bd11f4bbc)\r\n* use file upload to upload world countries into your Elastic stack\r\n* add a new cluster layer to your map. \r\n * Select world countries index\r\n * Select **Hexagons**\r\n * Click **Add layer**\r\n * Save map\r\n* Open borwser dev tools and switch to network tab\r\n* Open Kibana dev tools and run \r\n ```\r\n POST kbn:api/telemetry/v2/clusters/_stats\r\n { \"unencrypted\": true }\r\n ```\r\n* Copy response for `_stats` request. Search for\r\n`geoShapeAggLayersCount`. Notice how the value is zero but it should be\r\none since you have one map using geo shape aggregation\r\n<img width=\"600\" alt=\"Screen Shot 2023-02-13 at 1 14 34 PM\"\r\nsrc=\"https://user-images.githubusercontent.com/373691/218565153-0060dd4b-e422-477f-8b07-9f4dabd73064.png\">\r\n\r\n\r\nPR resolves the problem by removing layer type guard. The guard is error\r\nprone and easy to not update with new layer types. The guard does not\r\nprovide any value, since the logic is really concerned with source types\r\nand the source type guards provide the correct protections.\r\n\r\nSteps to test:\r\nFollow steps above and verify `geoShapeAggLayersCount` is one","sha":"9d7e1095365483ad9bc083829f3f291c478e7cf6","branchLabelMapping":{"^v8.8.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Team:Presentation","release_note:skip","auto-backport","Feature:Maps","v8.7.0","v8.8.0"],"number":151072,"url":"https://github.com/elastic/kibana/pull/151072","mergeCommit":{"message":"[maps] include vector tile layers in geoShapeAggLayersCount telemetry (#151072)\n\nWhile investigating #151064, I\r\nfound a problem with IndexPatternStatsCollector where geo_shape\r\naggregation usage with vector tile layers are not counted.\r\n\r\nSteps to view problem:\r\n* Download [world countries\r\ngeojson](https://vector.maps.elastic.co/files/world_countries_v7.geo.json?elastic_tile_service_tos=agree&my_app_name=ems-landing-page&my_app_version=8.6.0&license=643c1faf-80fc-4ab0-9323-4d9bd11f4bbc)\r\n* use file upload to upload world countries into your Elastic stack\r\n* add a new cluster layer to your map. \r\n * Select world countries index\r\n * Select **Hexagons**\r\n * Click **Add layer**\r\n * Save map\r\n* Open borwser dev tools and switch to network tab\r\n* Open Kibana dev tools and run \r\n ```\r\n POST kbn:api/telemetry/v2/clusters/_stats\r\n { \"unencrypted\": true }\r\n ```\r\n* Copy response for `_stats` request. Search for\r\n`geoShapeAggLayersCount`. Notice how the value is zero but it should be\r\none since you have one map using geo shape aggregation\r\n<img width=\"600\" alt=\"Screen Shot 2023-02-13 at 1 14 34 PM\"\r\nsrc=\"https://user-images.githubusercontent.com/373691/218565153-0060dd4b-e422-477f-8b07-9f4dabd73064.png\">\r\n\r\n\r\nPR resolves the problem by removing layer type guard. The guard is error\r\nprone and easy to not update with new layer types. The guard does not\r\nprovide any value, since the logic is really concerned with source types\r\nand the source type guards provide the correct protections.\r\n\r\nSteps to test:\r\nFollow steps above and verify `geoShapeAggLayersCount` is one","sha":"9d7e1095365483ad9bc083829f3f291c478e7cf6"}},"sourceBranch":"main","suggestedTargetBranches":["8.7"],"targetPullRequestStates":[{"branch":"8.7","label":"v8.7.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.8.0","labelRegex":"^v8.8.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/151072","number":151072,"mergeCommit":{"message":"[maps] include vector tile layers in geoShapeAggLayersCount telemetry (#151072)\n\nWhile investigating #151064, I\r\nfound a problem with IndexPatternStatsCollector where geo_shape\r\naggregation usage with vector tile layers are not counted.\r\n\r\nSteps to view problem:\r\n* Download [world countries\r\ngeojson](https://vector.maps.elastic.co/files/world_countries_v7.geo.json?elastic_tile_service_tos=agree&my_app_name=ems-landing-page&my_app_version=8.6.0&license=643c1faf-80fc-4ab0-9323-4d9bd11f4bbc)\r\n* use file upload to upload world countries into your Elastic stack\r\n* add a new cluster layer to your map. \r\n * Select world countries index\r\n * Select **Hexagons**\r\n * Click **Add layer**\r\n * Save map\r\n* Open borwser dev tools and switch to network tab\r\n* Open Kibana dev tools and run \r\n ```\r\n POST kbn:api/telemetry/v2/clusters/_stats\r\n { \"unencrypted\": true }\r\n ```\r\n* Copy response for `_stats` request. Search for\r\n`geoShapeAggLayersCount`. Notice how the value is zero but it should be\r\none since you have one map using geo shape aggregation\r\n<img width=\"600\" alt=\"Screen Shot 2023-02-13 at 1 14 34 PM\"\r\nsrc=\"https://user-images.githubusercontent.com/373691/218565153-0060dd4b-e422-477f-8b07-9f4dabd73064.png\">\r\n\r\n\r\nPR resolves the problem by removing layer type guard. The guard is error\r\nprone and easy to not update with new layer types. The guard does not\r\nprovide any value, since the logic is really concerned with source types\r\nand the source type guards provide the correct protections.\r\n\r\nSteps to test:\r\nFollow steps above and verify `geoShapeAggLayersCount` is one","sha":"9d7e1095365483ad9bc083829f3f291c478e7cf6"}}]}] BACKPORT--> Co-authored-by: Nathan Reese <[email protected]>
In #151110 @afharo was able to reproduce the problem even with empty fields. Which validates my theory, the problem we observed is because of a spike in network connections not the payload size of these connections. Having a large payload would obviously make this even worse but as a start I think we should try to optimise the amount of outgoing requests. I "manually" classified the requests using a filter aggregation. These are the requests caused by data views telemetry for a single telemetry usage collection run: Maps telemetry causes an identical distribution of data views related requests but this time nearly double the amount: Explanation of the "legend":
I can't really make sense of these numbers it seems like we do 1 of each, but why By doing a breakdown by Perhaps at different times different amounts of the data views are already cached? Or perhaps the server gets overloaded and there's a timeout before all the requests are able to complete? I can't really explain the differences here. But fundamentally there's no batching. Each index pattern is loaded individually. If we loaded batches of 100 we'd probably have 100x less requests of (1), (2), (3). Not sure if we want to batch _field_caps requests as that could cause a large payload response. |
Maps telemetry is calculating the total number of indices with geospatial fields. Batching _field_caps requests with field type filtering for geo_point and geo_shape would work really well for this use case. |
Something I don't yet understand is https://github.com/elastic/kibana/pull/85706/files added telemetry to the HTTP saved objects API, so internal calls to the server-side saved objects client shouldn't cause this |
If I'm understanding this correctly -
Finally, using Overall it seems that we can cut the number of requests in half. Is this good enough? I don't think it is. Looking at the SDH, I can see extended periods where kibana was impacted - perhaps a couple of hours. Is there other telemetry collection code that fires off a lot of requests? Do we need to take a wider view?
What is it about making requests thats CPU or memory intensive? Am I making too much of the spikes in the resource utilization in the SDH? |
I had a look at the payload and this is not actually core-usage-data from the HTTP APIs https://github.com/elastic/kibana/pull/85706/files it's additional telemetry for
|
In my legend explanation I left out the MGET's that retrieve one data view at a time. If we ignore _field_caps requests there are 120+3113+451+7203=3664 data view related requests. As a worst case scenario I'm assuming we don't share a data view cache between data view telemetry and maps, so each of these collectors refresh all the data views (even though the numbers suggest this is not true). If we get batches of 1000 data views it means to refresh 1600 data views would cost 2 batches: So worst case assuming a cold cache for both collectors we'd reduce the 3884 requests to 6 requests. In the worst case cold-cache scenario there would still be an additional 3200 _field_caps requests from data views + maps. And here batching might not be as useful as it could easily cause a very large response payload when there's many fields which comes with it's own problems. |
Because we use HTTPS, requests are fairly CPU intensive. Each requests also requires a file descriptor for it's socket and other memory to keep the state of the SSL connection. So in my experience any time Kibana does a sudden spike of > 1000 requests we'll block the event loop. Once established the CPU overhead is minimal, so ramping up 1-1000 connections over say 30s still consumes memory but doesn't impact the event loop as much. This does not mean SSL is the only bottleneck. It could very well be that the processing on the data views themselves are also contributing. I can show you how to profile this to accurately identify the CPU blockers. Specifically for _field_caps requests, before spending too much time trying to optimise it, I think it's worth validating the business value of this metric. Are we really making better decisions because we know this data? Isn't this already collected in ES data collection? If we do need this data then we will have to find a way to not collect it during usage collection but more asynchronously. One way would be a background task, but I also wonder if we want to refresh _field_caps for all data views every day. Can we only collect the field_caps for data views when they are loaded by a user and then store the counts on the data view saved object? Or maybe we can only collect field_caps for data views used in the last 24 hours, or the 100 most recently used data views? |
Thats very well worth verifying.
This is a question for @nreese as I think the data view telemetry could skip loading fields.
I guess it might be. Where can I find info on this? I didn't even know it existed.
My idea is to produce a new index which roughly has documents 1:1 with data views. This way we should be able to get the info we need with a query, but hopefully we can avoid this. |
That is a question for product. From our team's perspective it is helpful to know how many clusters have geospatial data and don't use maps.
In the maps use case, _field_caps request could be reduced by fetching geo_point and geo_shape fields for multiple data views in a single request. The response size should stay limited because of the field type filter. |
There is a lot of discussion regarding optimizing telemetry collection - yes, it is right that all telemetry collection should be efficient. Should we collect this in the first place - yes, it is right we should only collect telemetry that is required to improve our service. However, more importantly, do we have guardrails and circuit breakers for this collection? There is always a cluster out there pushing the bounds of object creation. Therefore, it is always better to drop telemetry data, than to let this collection process run at all costs. Fixing individual telemetry collectors is a whack-a-mole approach. Do we have overall breakers? |
Digging through maps' collector code, it looks like https://github.com/elastic/kibana/pull/147825/files#diff-956293b485c600c77352026d223c498055f37cf7bf78ad45bf6808795dcfa2fcR882 might reduce a good amount of pressure 🎉 IIUC, that API should cache the fields and reduce many requests, wouldn't it?
FWIW, we are looking for options to circuit-break those rogue collectors when possible. All the solutions we've found so far tend to have workarounds that the usage collectors could exploit:
TBH, I think it's important to keep in mind that the Telemetry Snapshot is reported once daily. For this reason, we tend to rely a lot on caching (if the report is generated but we fail to ship it, we should reuse the generated report). While it's not ideal that there's a daily spike that could momentarily freeze Kibana in some scenarios, I think collecting metrics and improving those over time is the way to go. AFAIK the primary cause of SDHs is that external actors are abusing the telemetry endpoints to generate the report (with Having said that, I'll keep thinking about ways to circuit-break any rogue collectors that might go out of hand. |
Maps application has been updated to remove data view telemetry #152124 |
#152298 greatly reduces this problem. merged as of 8.7 |
@rudolf Is this still relevant after #151064 (comment) ? |
Data views usage collection causes several Elasticsearch queries to be run per data view. When clusters have a large amount of data views this can cause a spike in connections to Elasticsearch, large increases in memory consumption and event loop delays ultimately leading to degraded performance. In one example there were 4689 Elasticsearch queries generated by 1618 data views.
These requests include capturing new telemetry from the saved objects
resolve
API #112025 andcore-usage-stats
#85706 both introduced by sharing to multiple spaces #113743The impact of this is further exacberated by the fact that the maps plugin will also iterate over all data views for it's usage collection.
The text was updated successfully, but these errors were encountered: