[Telemetry] Add scalability tests for known bottlenecks #151110

afharo · 2023-02-14T10:27:18Z

Summary

We've identified that for large deployments, some collectors take a big hit in the performance of Kibana (the event loop delay goes crazy high).

This PR adds some scalability tests to measure any future improvements in this area:

Run node scripts/run_scalability.js --journey-path x-pack/test/scalability/apis/api.telemetry.cluster_stats.no_cache.1600_dataviews.json to test if your changes improve the current behaviour.

Adding @dmlemeshko as a reviewer to validate that I'm reading the data correctly 😅

For maintainers

This was checked for breaking API changes and was labeled appropriately

afharo

I've added a few API Scalability tests and the results are a bit mixed:

When caching is ON, we get better performance in this branch vs. main
However, when caching is OFF, main performs better: we get lower rate of timeouts (this branch makes the response slower) and Kibana releases the resources of each request earlier

I need to jump on another task... I'll get back to this when done.

afharo · 2023-02-14T12:46:00Z

x-pack/test/scalability/apis/api.telemetry.cluster_stats.no-cache.json

+      {
+        "action": "rampUsersPerSec",
+        "minUsersCount": 1,
+        "maxUsersCount": 10,
+        "duration": "120s"
+      }


Low number of users as we don't expect this API to be called with refreshedCache: true too often.

Even with that, we are testing 690 requests in around 4 minutes (2 minutes execution + 120s of response time). That's many more requests than we expect (in the non-cached scenario).

Interesting case: what is the load we expect in non-cache scenario? How frequent?

Since these tests are targeting capacity check, it is expected to increase requests count within reasonable time.

FWIW, we only request the telemetry payload when we checked that it hasn't been reported in the last 24h.

That means we should expect, maximum, 1 daily peak of requests (where all active users at that moment may request the endpoint).

On top of that, that's true for the cached request. The non-cached one is only used for admins that want to audit what we send about them, so my expectation is 2-3 requests top.

afharo · 2023-02-14T12:47:05Z

x-pack/test/scalability/apis/api.telemetry.cluster_stats.no-cache.1600-dataviews.json

@@ -0,0 +1,47 @@
+{
+  "journeyName": "POST /api/telemetry/v2/clusters/_stats - no cache - 1600 dataviews",


This use case struggles even for such a low number of users. If maps improves their collector, this should improve.

dmlemeshko · 2023-02-15T15:23:35Z

buildkite/kibana-apis-capacity-testing pipeline failed and I'm also seeing error while running locally:

node scripts/run_scalability.js --journey-path x-pack/test/scalability/apis/api.telemetry.cluster_stats.no_cache.1600_dataviews.json

 proc [scalability-tests]  proc [gatling: test] Simulation org.kibanaLoadTest.simulation.generic.GenericJourney completed in 168 seconds
 proc [scalability-tests]  proc [gatling: test] java.lang.reflect.InvocationTargetException
 proc [scalability-tests]  proc [gatling: test] 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:119)
 proc [scalability-tests]  proc [gatling: test] 	at java.base/java.lang.reflect.Method.invoke(Method.java:577)
 proc [scalability-tests]  proc [gatling: test] 	at io.gatling.plugin.util.ForkMain.runMain(ForkMain.java:67)
 proc [scalability-tests]  proc [gatling: test] 	at io.gatling.plugin.util.ForkMain.main(ForkMain.java:35)
 proc [scalability-tests]  proc [gatling: test] Caused by: java.lang.RuntimeException: Login request failed: org.apache.http.conn.HttpHostConnectException: Connect to localhost:5620 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused
 proc [scalability-tests]  proc [gatling: test] 	at org.kibanaLoadTest.helpers.KbnClient.getCookie(KbnClient.scala:72)
 proc [scalability-tests]  proc [gatling: test] 	at org.kibanaLoadTest.helpers.KbnClient.getClientAndConnectionManager(KbnClient.scala:50)
 proc [scalability-tests]  proc [gatling: test] 	at org.kibanaLoadTest.helpers.KbnClient.unload(KbnClient.scala:139)

I think test is killing Kibana and Gatling fails to unload kbn-archive :(
I will work on a fix (to skip unloading if Kibana/ES not responding)

afharo · 2023-02-16T09:23:43Z

I'm going to split the concurrency limits and the new tests. I think @mattkime might appreciate having these tests available.

elasticmachine · 2023-02-16T14:35:23Z

Pinging @elastic/kibana-core (Team:Core)

afharo · 2023-02-16T14:35:41Z

@elasticmachine merge upstream

afharo · 2023-02-20T14:33:28Z

#151626 might improve the metrics observed in these new tests. But we may still like to improve how we handle fetching Data Views

dmlemeshko · 2023-02-21T17:53:14Z

x-pack/test/scalability/apis/api.telemetry.cluster_stats.no_cache.1600_dataviews.json

@@ -0,0 +1,47 @@
+{
+  "journeyName": "POST /api/telemetry/v2/clusters/_stats - no cache - 1600 dataviews",
+  "scalabilitySetup": {


Based on local results, I think we need to adjust threshold to track.

"responseTimeThreshold": { "threshold1": 10000, "threshold2": 20000, "threshold3": 30000 },

Not sure of response time matters here, we can increase values just to understand how many requests were completed.

++ to increasing the timeouts... I'm concerned about the Connection refused failures, though... shouldn't those be timeouts instead?

Timeouts increased in elastic/kibana@98c1f84 (#151110)

@afharo I think this test basically kills Kibana server and that's why we see "Connection refused"

I can see multiple errors like

[2023-02-21T18:51:35.148+01:00][ERROR][plugins.dataViews.dataView.indexPatterns] ConnectionError: connect EADDRNOTAVAIL 127.0.0.1:9220 - Local (0.0.0.0:0) at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/@elastic/transport/src/Transport.ts:585:17) at runMicrotasks (<anonymous>) at processTicksAndRejections (node:internal/process/task_queues:96:5) at KibanaTransport.request (create_transport.ts:57:17) at ClientTraced.FieldCapsApi [as fieldCaps] (/Users/dmle/github/kibana/node_modules/@elastic/elasticsearch/src/api/api/field_caps.ts:78:10) at callFieldCapsApi (es_api.ts:74:12) at getFieldCapabilities (field_capabilities.ts:46:23) at IndexPatternsFetcher.getFieldsForWildcard (index_patterns_fetcher.ts:76:31) at IndexPatternsApiServer.getFieldsForWildcard (index_patterns_api_client.ts:33:12) at DataViewsService.getFieldsAndIndicesForWildcard (data_views.ts:541:12) at DataViewsService.refreshFieldSpecMap (data_views.ts:624:46) at DataViewsService.initFromSavedObjectLoadFields (data_views.ts:756:33) at DataViewsService.initFromSavedObject (data_views.ts:798:34)

and

2023-02-16T16:01:15.143+01:00][ERROR][http] NoLivingConnectionsError: There are no living connections at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/@elastic/transport/src/Transport.ts:456:17) at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/elastic-apm-node/lib/instrumentation/modules/@elastic/elasticsearch.js:143:28) at KibanaTransport.request (create_transport.ts:57:29) at Security.hasPrivileges (/Users/dmle/github/kibana/node_modules/@elastic/elasticsearch/src/api/api/security.ts:962:33) at checkPrivilegesAtResources (check_privileges.ts:121:81) at runMicrotasks (<anonymous>) at processTicksAndRejections (node:internal/process/task_queues:96:5) at Object.atSpaces (check_privileges.ts:219:16) at SavedObjectsSecurityExtension.checkSavedObjectsPrivileges [as checkPrivilegesFunc] (check_saved_objects_privileges.ts:52:20) at SavedObjectsSecurityExtension.checkPrivileges (saved_objects_security_extension.ts:612:14) at SavedObjectsSecurityExtension.checkAuthorization (saved_objects_security_extension.ts:411:45) at SavedObjectsSecurityExtension.authorize (saved_objects_security_extension.ts:578:54) at SavedObjectsSecurityExtension.authorizeGet (saved_objects_security_extension.ts:822:12) at SavedObjectsRepository.get (repository.ts:1704:33) at SavedObjectsClient.get (saved_objects_client.ts:117:12) at UiSettingsClient.read (ui_settings_client_common.ts:158:20) at UiSettingsClient.getUserProvided (ui_settings_client_common.ts:53:59) at UiSettingsClient.getAll (base_ui_settings_client.ts:56:26) at Object.fieldFormatServiceFactory (plugin.ts:44:31) at getVisData (get_vis_data.ts:35:30) at vis.ts:36:23 at Router.handle (router.ts:192:30) at handler (router.ts:147:13) at exports.Manager.execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28) at Object.internals.handler (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20) at exports.execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20) at Request._lifecycle (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/request.js:371:32) at Request._execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)

We covered this on Zoom. The intention of this test is to show that this API is problematic in this scenario. It will help us track any improvements for different efforts that have branched off since we've identified this bottleneck.

Is it a problem to have reports that show failures?

…n/apply-concurrency-limits

afharo · 2023-02-27T09:00:34Z

Assigning to @dmlemeshko because he wants to run some additional tests to fine-tune the thresholds 😇

…n/apply-concurrency-limits

kibana-ci · 2023-03-10T00:00:48Z

💚 Build Succeeded

Buildkite Build
Commit: 8b31884

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`securitySolution`	434	437	+3

Total ESLint disabled count

id	before	after	diff
`securitySolution`	514	517	+3

History

💚 Build #113162 succeeded f3469fd
💚 Build #113042 succeeded c18c9de
💚 Build #109396 succeeded 98c1f84
💚 Build #109364 succeeded 421688a
💚 Build #108615 succeeded 74f756f
💚 Build #108572 succeeded 35ebbf7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @afharo @dmlemeshko

Co-authored-by: Kibana Machine <[email protected]>

[Usage Collection] Apply concurrency limits

af70df6

afharo added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc performance Feature:Telemetry technical debt Improvement of the software architecture and operational architecture labels Feb 14, 2023

afharo requested a review from dmlemeshko February 14, 2023 10:27

afharo self-assigned this Feb 14, 2023

Add more load tests

a01045d

afharo commented Feb 14, 2023

View reviewed changes

Comply with eslint name casing rules

824a395

afharo added release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting labels Feb 14, 2023

afharo force-pushed the usage_collection/apply-concurrency-limits branch from 821f56f to dda022c Compare February 16, 2023 12:52

Stop applying concurrency limits. We want to merge the extra tests

35ebbf7

afharo force-pushed the usage_collection/apply-concurrency-limits branch from dda022c to 35ebbf7 Compare February 16, 2023 12:53

afharo changed the title ~~[Usage Collection] Apply concurrency limits~~ [Telemetry] Add scalability tests for known bottlenecks Feb 16, 2023

afharo marked this pull request as ready for review February 16, 2023 14:35

Merge branch 'main' into usage_collection/apply-concurrency-limits

74f756f

afharo enabled auto-merge (squash) February 16, 2023 14:36

rudolf mentioned this pull request Feb 16, 2023

Data views usage collection causes large amount of Elasticsearch queries #151064

Closed

rudolf mentioned this pull request Feb 21, 2023

Rough POC of an EC circuit breaker #151657

Closed

9 tasks

Merge branch 'main' into usage_collection/apply-concurrency-limits

421688a

dmlemeshko reviewed Feb 21, 2023

View reviewed changes

Increase timeout

98c1f84

afharo mentioned this pull request Feb 21, 2023

[Elasticsearch Client] Limit Kibana's internal client's maxSockets #151778

Closed

13 tasks

Merge branch 'main' of github.com:elastic/kibana into usage_collectio…

f493167

…n/apply-concurrency-limits

afharo mentioned this pull request Feb 23, 2023

[Elasticsearch] Limit maxSockets to 800 by default #151911

Merged

1 task

afharo assigned dmlemeshko Feb 27, 2023

rudolf mentioned this pull request Mar 1, 2023

[data views] More efficient telemetry collection #152298

Merged

1 task

afharo added 5 commits March 9, 2023 15:37

Merge branch 'main' of github.com:elastic/kibana into usage_collectio…

a7e7121

…n/apply-concurrency-limits

Adjust thresholds as per elastic#151110

c18c9de

Remove custom thresholds after maps collector was removed

f3469fd

Merge branch 'main' of github.com:elastic/kibana into usage_collectio…

b389aa1

…n/apply-concurrency-limits

Default thresholds are still too high

8b31884

dmlemeshko approved these changes Mar 10, 2023

View reviewed changes

afharo merged commit 84cc0eb into elastic:main Mar 10, 2023

afharo deleted the usage_collection/apply-concurrency-limits branch March 10, 2023 09:45

kibanamachine added the v8.8.0 label Mar 10, 2023

bmorelli25 pushed a commit to bmorelli25/kibana that referenced this pull request Mar 10, 2023

[Telemetry] Add scalability tests for known bottlenecks (elastic#151110)

038b0c4

Co-authored-by: Kibana Machine <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Telemetry] Add scalability tests for known bottlenecks #151110

[Telemetry] Add scalability tests for known bottlenecks #151110

afharo commented Feb 14, 2023 •

edited

Loading

afharo left a comment

afharo Feb 14, 2023

afharo Feb 14, 2023 •

edited

Loading

dmlemeshko Feb 15, 2023

afharo Feb 16, 2023 •

edited

Loading

afharo Feb 14, 2023

dmlemeshko commented Feb 15, 2023 •

edited

Loading

afharo commented Feb 16, 2023

elasticmachine commented Feb 16, 2023

afharo commented Feb 16, 2023

afharo commented Feb 20, 2023

dmlemeshko Feb 21, 2023

dmlemeshko Feb 21, 2023

afharo Feb 21, 2023 •

edited

Loading

afharo Feb 21, 2023

dmlemeshko Feb 21, 2023

afharo Feb 22, 2023 •

edited

Loading

afharo commented Feb 27, 2023

kibana-ci commented Mar 10, 2023

ESLint disabled line counts

Total ESLint disabled count

		@@ -0,0 +1,47 @@
		{
		"journeyName": "POST /api/telemetry/v2/clusters/_stats - no cache - 1600 dataviews",

[Telemetry] Add scalability tests for known bottlenecks #151110

[Telemetry] Add scalability tests for known bottlenecks #151110

Conversation

afharo commented Feb 14, 2023 • edited Loading

Summary

For maintainers

afharo left a comment

Choose a reason for hiding this comment

afharo Feb 14, 2023

Choose a reason for hiding this comment

afharo Feb 14, 2023 • edited Loading

Choose a reason for hiding this comment

dmlemeshko Feb 15, 2023

Choose a reason for hiding this comment

afharo Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

afharo Feb 14, 2023

Choose a reason for hiding this comment

dmlemeshko commented Feb 15, 2023 • edited Loading

afharo commented Feb 16, 2023

elasticmachine commented Feb 16, 2023

afharo commented Feb 16, 2023

afharo commented Feb 20, 2023

dmlemeshko Feb 21, 2023

Choose a reason for hiding this comment

dmlemeshko Feb 21, 2023

Choose a reason for hiding this comment

afharo Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

afharo Feb 21, 2023

Choose a reason for hiding this comment

dmlemeshko Feb 21, 2023

Choose a reason for hiding this comment

afharo Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

afharo commented Feb 27, 2023

kibana-ci commented Mar 10, 2023

💚 Build Succeeded

Metrics [docs]

ESLint disabled line counts

Total ESLint disabled count

History

afharo commented Feb 14, 2023 •

edited

Loading

afharo Feb 14, 2023 •

edited

Loading

afharo Feb 16, 2023 •

edited

Loading

dmlemeshko commented Feb 15, 2023 •

edited

Loading

afharo Feb 21, 2023 •

edited

Loading

afharo Feb 22, 2023 •

edited

Loading