[ML] APM Correlations: Fix usage in load balancing/HA setups. #115145

walterra · 2021-10-15T07:07:29Z

Summary

The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.

Checklist

Unit or functional tests were updated or added to match the most common scenarios
This was checked for breaking API changes and was labeled appropriately

dgieselaar · 2021-11-05T12:16:41Z

My feedback mostly revolves around parallelising some of the requests which can hopefully improve performance. Long term I think we should move this to one or two API calls and remove progressive loading, I don't think it's worth the complexity.

kibanamachine · 2021-11-08T15:10:49Z

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`apm`	1184	1189	+5

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`apm`	2.7MB	2.7MB	+4.0KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`apm`	37	41	+4

History

💔 Build #4943 failed 22f785c
💔 Build #4910 failed de63d24
💔 Build #4767 failed 86b78c1
💔 Build #4761 failed 24d1d2b
💚 Build #4618 succeeded 97890cf
💚 Build #4185 succeeded 5310686

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @walterra

dgieselaar · 2021-11-08T15:42:35Z

x-pack/plugins/apm/public/components/app/correlations/use_failed_transactions_correlations.ts

+      const { fieldCandidates: candidates } = await callApmApi({
+        endpoint: 'GET /internal/apm/correlations/field_candidates',
+        signal: abortCtrl.current.signal,
+        params: {
+          query: fetchParams,
+        },
+      });


This can be parallelised as well no?

I parallelised the first two requests to get the chart data first so we can show the chart as soon as it's available. This one is the first of the requests of the analysis, the requests after that one depends on its output.

I'm not sure if I fully understand, but there's no need (AFAICT) to delay starting the request until the data for the chart has been fully loaded. E.g. you can start the request but only await it once the data for the charts have loaded.

dgieselaar · 2021-11-08T15:43:00Z

x-pack/plugins/apm/public/components/app/correlations/use_failed_transactions_correlations.ts

+
+      const fieldCandidatesChunks = chunk(fieldCandidates, chunkSize);
+
+      for (const fieldCandidatesChunk of fieldCandidatesChunks) {


This splits all field candidates into chunks, the chunks are called in sequence here on the client side, but all field candidates of a chunk are then queried in parallel on the Kibana server side.

Yes, but why call these in sequence? how many blocking calls can we expect here?

This was made in the spirit of "make it slow". I'm sure this can be further optimized, in this PR we started to parallelize the server side calls and play it safe on the client side. Since the field candidates and field value pairs are generated dynamically, we don't want to allow to run an unlimited amount of queries in parallel. Field candidates are usually in the dozens, field value pairs can be in the hundreds.

I don't see how running dozens of requests sequentially is a better experience. We can use pLimit here to limit the number of concurrent requests. Something like 5 sounds like a good start.

+1 for using p-limit here

p-limit sounds like a good idea worth pursuing but can we agree to do this in a follow up?

yes that's fine 👍

dgieselaar

I'll approve this with the caveat that I haven't fully tested it - I'm mostly new to this code and we don't have a much time, and this bug fix is sorely needed. Thanks @walterra!

…c#115145) - The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints. - The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state. - The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is. - API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted. - Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side. - For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though. - The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel. Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.

…115145) (#117979) * [ML] APM Correlations: Fix usage in load balancing/HA setups. (#115145) - The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints. - The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state. - The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is. - API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted. - Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side. - For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though. - The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel. Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though. * [ML] Fix http client types.

… (#118004) - The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints. - The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state. - The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is. - API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted. - Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side. - For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though. - The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel. Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.

Conflict between #117958 and #115145 Signed-off-by: Tyler Smalley <[email protected]>

Conflict between elastic#117958 and elastic#115145 Signed-off-by: Tyler Smalley <[email protected]>

…117958) (#118074) * [kbn/io-ts] export and require importing individual functions (#117958) * [kbn/io-ts] fix direct import Conflict between #117958 and #115145 Signed-off-by: Tyler Smalley <[email protected]> Co-authored-by: Spencer <[email protected]> Co-authored-by: spalger <[email protected]> Co-authored-by: Tyler Smalley <[email protected]>

walterra self-assigned this Oct 15, 2021

walterra force-pushed the ml-apm-correlations-fix-load-balancing branch from 0255b74 to 538cf24 Compare October 15, 2021 13:47

walterra changed the title ~~[ML] APM Correlations: Migrate search strategy to regular endpoints~~ [ML] APM Correlations: Migrate custom search strategy to regular endpoints Oct 16, 2021

walterra changed the title ~~[ML] APM Correlations: Migrate custom search strategy to regular endpoints~~ [ML] APM Correlations: Migrate custom search strategies to regular endpoints Oct 16, 2021

walterra force-pushed the ml-apm-correlations-fix-load-balancing branch 4 times, most recently from 28eb8d9 to ece4fe0 Compare October 19, 2021 19:18

walterra added the bug Fixes for quality problems that affect the customer experience label Oct 20, 2021

walterra changed the title ~~[ML] APM Correlations: Migrate custom search strategies to regular endpoints~~ [ML] APM Correlations: Fix usage in load balancing/HA setups. Oct 20, 2021

walterra added the release_note:fix label Oct 20, 2021

walterra added 19 commits October 20, 2021 15:33

[ML] Move data fetching for overall latency histogram to custom hook.

e1fa9c4

[ML] Migrates field candidates to regular endpoint.

46f59d6

[ML] Fetch latency correlations via regular endpoints.

0e37684

[ML] Fetch failed transaction correlations via regular endpoints.

7d75d5f

[ML] Fix types.

37aa29b

[ML] Rename common/search_strategies to common/correlations.

414a992

[ML] Rename server/lib/search_strategies to server/lib/correlations.

83f5ae0

[ML] Fix API integration tests.

8d92d94

[ML] Remove the no longer needed 'took' attribute.

9de6ff6

[ML] Adds chunking to field value pair loading.

30d3883

[ML] Fix ccsWarning.

91e5f13

[ML] Fix jest test.

c1cf1e5

[ML] Fix item check.

72eb07d

[ML] Refactor away from using generators.

ba86ce1

[ML] Remove references to log.

3368aba

[ML] Get rid of SearchStrategy references in type/var names.

430aa12

[ML] Deduplicate some code.

aa80643

[ML] Adds debouncing to correlation analysis.

4f080bf

[ML] Fix types.

8868bb9

[ML] Use abort signal instead of isCancelledRef.

97890cf

walterra mentioned this pull request Nov 5, 2021

[APM] Correlations: Improve "Log-log plot" description #117659

Closed

walterra added 5 commits November 5, 2021 19:38

Merge branch 'main' into ml-apm-correlations-fix-load-balancing

24d1d2b

[ML] Fix test imports.

86b78c1

Merge branch 'main' into ml-apm-correlations-fix-load-balancing

de63d24

fix field value pair error handling

22f785c

[ML] Fix jest test.

9872afc

dgieselaar reviewed Nov 8, 2021

View reviewed changes

dgieselaar approved these changes Nov 9, 2021

View reviewed changes

walterra merged commit f9c982d into elastic:main Nov 9, 2021

walterra deleted the ml-apm-correlations-fix-load-balancing branch November 9, 2021 09:27

walterra mentioned this pull request Nov 9, 2021

[8.0] [ML] APM Correlations: Fix usage in load balancing/HA setups. (#115145) #117979

Merged

walterra mentioned this pull request Nov 9, 2021

[7.16] [ML] APM Correlations: Fix usage in load balancing/HA setups. (#115145) #118004

Merged

walterra mentioned this pull request Nov 9, 2021

[APM] Latency correlations 7.16 meta issue #109220

Closed

11 tasks

tylersmalley pushed a commit that referenced this pull request Nov 9, 2021

[kbn/io-ts] fix direct import

994a4e4

Conflict between #117958 and #115145 Signed-off-by: Tyler Smalley <[email protected]>

spalger pushed a commit to kibanamachine/kibana that referenced this pull request Nov 9, 2021

[kbn/io-ts] fix direct import

c2cbd7d

Conflict between elastic#117958 and elastic#115145 Signed-off-by: Tyler Smalley <[email protected]>

walterra mentioned this pull request Nov 10, 2021

[APM] Latency distribution chart fails to load with AsyncSearchService error #114046

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] APM Correlations: Fix usage in load balancing/HA setups. #115145

[ML] APM Correlations: Fix usage in load balancing/HA setups. #115145

walterra commented Oct 15, 2021 •

edited

Loading

dgieselaar commented Nov 5, 2021

kibanamachine commented Nov 8, 2021

dgieselaar Nov 8, 2021

walterra Nov 8, 2021

dgieselaar Nov 8, 2021

dgieselaar Nov 8, 2021

walterra Nov 8, 2021

dgieselaar Nov 8, 2021

walterra Nov 8, 2021

dgieselaar Nov 8, 2021

sorenlouv Nov 8, 2021 •

edited

Loading

walterra Nov 9, 2021

dgieselaar Nov 9, 2021

dgieselaar left a comment


		const fieldCandidatesChunks = chunk(fieldCandidates, chunkSize);

		for (const fieldCandidatesChunk of fieldCandidatesChunks) {

[ML] APM Correlations: Fix usage in load balancing/HA setups. #115145

[ML] APM Correlations: Fix usage in load balancing/HA setups. #115145

Conversation

walterra commented Oct 15, 2021 • edited Loading

Summary

Checklist

dgieselaar commented Nov 5, 2021

kibanamachine commented Nov 8, 2021

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

Public APIs missing exports

History

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sorenlouv Nov 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgieselaar left a comment

Choose a reason for hiding this comment

walterra commented Oct 15, 2021 •

edited

Loading

sorenlouv Nov 8, 2021 •

edited

Loading