-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Telemetry] Add scalability tests for known bottlenecks #151110
[Telemetry] Add scalability tests for known bottlenecks #151110
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few API Scalability tests and the results are a bit mixed:
- When caching is ON, we get better performance in this branch vs.
main
- However, when caching is OFF,
main
performs better: we get lower rate of timeouts (this branch makes the response slower) and Kibana releases the resources of each request earlier
I need to jump on another task... I'll get back to this when done.
{ | ||
"action": "rampUsersPerSec", | ||
"minUsersCount": 1, | ||
"maxUsersCount": 10, | ||
"duration": "120s" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Low number of users as we don't expect this API to be called with refreshedCache: true
too often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with that, we are testing 690 requests in around 4 minutes (2 minutes execution + 120s of response time). That's many more requests than we expect (in the non-cached scenario).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting case: what is the load we expect in non-cache scenario? How frequent?
Since these tests are targeting capacity check, it is expected to increase requests count within reasonable time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we only request the telemetry payload when we checked that it hasn't been reported in the last 24h.
That means we should expect, maximum, 1 daily peak of requests (where all active users at that moment may request the endpoint).
On top of that, that's true for the cached request. The non-cached one is only used for admins that want to audit what we send about them, so my expectation is 2-3 requests top.
@@ -0,0 +1,47 @@ | |||
{ | |||
"journeyName": "POST /api/telemetry/v2/clusters/_stats - no cache - 1600 dataviews", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use case struggles even for such a low number of users. If maps improves their collector, this should improve.
buildkite/kibana-apis-capacity-testing pipeline failed and I'm also seeing error while running locally:
I think test is killing Kibana and Gatling fails to unload kbn-archive :( |
I'm going to split the concurrency limits and the new tests. I think @mattkime might appreciate having these tests available. |
821f56f
to
dda022c
Compare
dda022c
to
35ebbf7
Compare
Pinging @elastic/kibana-core (Team:Core) |
@elasticmachine merge upstream |
#151626 might improve the metrics observed in these new tests. But we may still like to improve how we handle fetching Data Views |
@@ -0,0 +1,47 @@ | |||
{ | |||
"journeyName": "POST /api/telemetry/v2/clusters/_stats - no cache - 1600 dataviews", | |||
"scalabilitySetup": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on local results, I think we need to adjust threshold to track.
"responseTimeThreshold": {
"threshold1": 10000,
"threshold2": 20000,
"threshold3": 30000
},
Not sure of response time matters here, we can increase values just to understand how many requests were completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ to increasing the timeouts... I'm concerned about the Connection refused
failures, though... shouldn't those be timeouts instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeouts increased in elastic/kibana@98c1f84
(#151110)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afharo I think this test basically kills Kibana server and that's why we see "Connection refused"
I can see multiple errors like
[2023-02-21T18:51:35.148+01:00][ERROR][plugins.dataViews.dataView.indexPatterns] ConnectionError: connect EADDRNOTAVAIL 127.0.0.1:9220 - Local (0.0.0.0:0)
at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/@elastic/transport/src/Transport.ts:585:17)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at KibanaTransport.request (create_transport.ts:57:17)
at ClientTraced.FieldCapsApi [as fieldCaps]
(/Users/dmle/github/kibana/node_modules/@elastic/elasticsearch/src/api/api/field_caps.ts:78:10)
at callFieldCapsApi (es_api.ts:74:12)
at getFieldCapabilities (field_capabilities.ts:46:23)
at IndexPatternsFetcher.getFieldsForWildcard (index_patterns_fetcher.ts:76:31)
at IndexPatternsApiServer.getFieldsForWildcard (index_patterns_api_client.ts:33:12)
at DataViewsService.getFieldsAndIndicesForWildcard (data_views.ts:541:12)
at DataViewsService.refreshFieldSpecMap (data_views.ts:624:46)
at DataViewsService.initFromSavedObjectLoadFields (data_views.ts:756:33)
at DataViewsService.initFromSavedObject (data_views.ts:798:34)
and
2023-02-16T16:01:15.143+01:00][ERROR][http] NoLivingConnectionsError: There are no living connections
at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/@elastic/transport/src/Transport.ts:456:17)
at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/elastic-apm-node/lib/instrumentation/modules/@elastic/elasticsearch.js:143:28)
at KibanaTransport.request (create_transport.ts:57:29)
at Security.hasPrivileges (/Users/dmle/github/kibana/node_modules/@elastic/elasticsearch/src/api/api/security.ts:962:33)
at checkPrivilegesAtResources (check_privileges.ts:121:81)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at Object.atSpaces (check_privileges.ts:219:16)
at SavedObjectsSecurityExtension.checkSavedObjectsPrivileges [as checkPrivilegesFunc] (check_saved_objects_privileges.ts:52:20)
at SavedObjectsSecurityExtension.checkPrivileges (saved_objects_security_extension.ts:612:14)
at SavedObjectsSecurityExtension.checkAuthorization (saved_objects_security_extension.ts:411:45)
at SavedObjectsSecurityExtension.authorize (saved_objects_security_extension.ts:578:54)
at SavedObjectsSecurityExtension.authorizeGet (saved_objects_security_extension.ts:822:12)
at SavedObjectsRepository.get (repository.ts:1704:33)
at SavedObjectsClient.get (saved_objects_client.ts:117:12)
at UiSettingsClient.read (ui_settings_client_common.ts:158:20)
at UiSettingsClient.getUserProvided (ui_settings_client_common.ts:53:59)
at UiSettingsClient.getAll (base_ui_settings_client.ts:56:26)
at Object.fieldFormatServiceFactory (plugin.ts:44:31)
at getVisData (get_vis_data.ts:35:30)
at vis.ts:36:23
at Router.handle (router.ts:192:30)
at handler (router.ts:147:13)
at exports.Manager.execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)
at Object.internals.handler (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)
at exports.execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)
at Request._lifecycle (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/request.js:371:32)
at Request._execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We covered this on Zoom. The intention of this test is to show that this API is problematic in this scenario. It will help us track any improvements for different efforts that have branched off since we've identified this bottleneck.
Is it a problem to have reports that show failures?
…n/apply-concurrency-limits
Assigning to @dmlemeshko because he wants to run some additional tests to fine-tune the thresholds 😇 |
…n/apply-concurrency-limits
…n/apply-concurrency-limits
💚 Build Succeeded
Metrics [docs]Unknown metric groupsESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: |
Co-authored-by: Kibana Machine <[email protected]>
Summary
We've identified that for large deployments, some collectors take a big hit in the performance of Kibana (the event loop delay goes crazy high).
This PR adds some scalability tests to measure any future improvements in this area:
Run
node scripts/run_scalability.js --journey-path x-pack/test/scalability/apis/api.telemetry.cluster_stats.no_cache.1600_dataviews.json
to test if your changes improve the current behaviour.Adding @dmlemeshko as a reviewer to validate that I'm reading the data correctly 😅
For maintainers