[Alerting] Improving health status check #93282

ymao1 · 2021-03-02T18:13:06Z

Resolves #93062

Summary

Made several changes to the alerting health check:

Added a share() to the combineLatest operator that combines the the core status with the alerting health status. I was seeing duplicate observable streams being created from getHealthStatusStream (88!), each firing at a 5 minute interval. Maybe it's possible that this many concurrent get requests to the task manager saved object was contributing to the 503 socket hangup errors?
Moved catchError from the top level interval observable to within the switchMap. When catchError was at the top level, it would handle the error and complete the stream, which means once the alerting status became unavailable, it would stop polling for updated status and remain in an error state.
Added a retryWhen operator which retries getting the status a few times before propagating the error status.

Checklist

Delete any items that are not applicable to this PR.

Unit or functional tests were updated or added to match the most common scenarios

…y on failure

elasticmachine · 2021-03-02T20:20:21Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

ymao1 · 2021-03-02T20:20:42Z

@elasticmachine merge upstream

YulNaumenko

LGTM! Awesome improvements!

gmmorris · 2021-03-03T13:21:46Z

x-pack/plugins/alerts/server/health/get_state.test.ts

+    for (let i = 0; i < MAX_RETRY_ATTEMPTS + 1; i++) {
+      await tick();
+      jest.advanceTimersByTime(retryDelay);
+    }


Could we add an assertion that mockTaskManager.get was actually called MAX_RETRY_ATTEMPTS times?

Otherwise, in theory, this test. won't catch anything if the stream never emits any values as all expects are inside the subscription handler.
(this is from experience... I've missed a bug before because I made the exact same mistake) :)

gmmorris · 2021-03-03T13:35:14Z

x-pack/plugins/alerts/server/health/get_state.test.ts

+    interval(pollInterval)
+      .pipe(
+        switchMap(() =>
+          getHealthServiceStatusWithRetryAndErrorHandling(mockTaskManager, retryDelay)
+        )
+      )
+      .subscribe();


Is it worth using getHealthStatusStream directly here with the interval being an argument (we can use default value in the implementation)?

That way the unit tests ensure the composition is behaving as expected.
Just incase someone changes the switchMap in getHealthStatusStream in the future to something that behaved differently... 🤔

gmmorris

LGTM from a logical perspective.
I haven't been able to test this locally because I'm not sure how to cause the failure case.

Any advice on whether that can be tested? 🤔

ymao1 · 2021-03-03T13:46:03Z

Any advice on whether that can be tested? 🤔

I did something like this in getLatestTaskState:

let getLatestTaskState = 0;
const shouldThrowError = [false, false, true, true, true, false, false...];
async function getLatestTaskState(taskManager: TaskManagerStartContract) {
  if (getLatestTaskState < shouldThrowError.length && shouldThrowError[getLatestTaskState]) {
    throw new Error();
  }
  getLatestTaskState++;
  
  try {
    const result = await taskManager.get(HEALTH_TASK_ID);
  }

}

Also shortened the HEALTH_STATUS_INTERVAL and RETRY_DELAY so it wouldn't take so long to run.

I changed the sequence in shouldThrowError to test out successful retries and maxing out retry attempts and made sure that even after a maxed out retry that returned unavailable, the interval continued polling.

pmuellr · 2021-03-03T14:48:55Z

I was seeing duplicate observable streams being created from getHealthStatusStream (88!), each firing at a 5 minute interval. Maybe it's possible that this many concurrent get requests to the task manager saved object was contributing to the 503 socket hangup errors?

Aha!

I was really curious how we were seeing 50x responses, since I thought these were auto-retried and such. It seems like it would make sense that if we had that many requests running, at least one of them could end up failing on each retry, and finally give up and return the 50x as the final reply.

…ing/health-check

kibanamachine · 2021-03-04T15:18:21Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #110509 succeeded b142573
💚 Build #110450 succeeded efff617

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

* wip * Moving catchError so observable stream does not complete. Adding retry on failure * Using retryWhen. Updating unit tests * PR fixes Co-authored-by: Kibana Machine <[email protected]>

kibanamachine · 2021-03-04T17:01:26Z

💚 Backport successful

✅ 7.x / #93639

Successful backport PRs will be merged automatically after passing CI.

* master: (107 commits) [Logs UI] Fix log stream data fetching (elastic#93201) [App Search] Added relevance tuning search preview (elastic#93054) [Canvas] Fix reports embeddables (elastic#93482) [ILM] Added new functional test in ILM for creating a new policy (elastic#92936) Remove direct dependency on statehood package (elastic#93592) [Maps] Track tile loading status (elastic#91585) [Discover][Doc] Improve main documentation (elastic#91854) [Upgrade Assistant] Disable UA and add prompt (elastic#92834) [Snapshot Restore] Remove cloud validation for slm policy (elastic#93609) [Maps] Support GeometryCollections in GeoJson upload (elastic#93507) [XY Charts] fix partial histogram endzones annotations (elastic#93091) [Core] Simplify context typings (elastic#93579) [Alerting] Improving health status check (elastic#93282) [Discover] Restore context documentation (elastic#90784) [core-docs] Edits core-intro section for the new docs system (elastic#93540) add missed codeowners (elastic#89714) fetch node labels via script execution (elastic#93225) [Security Solution] Adds getMockTheme function (elastic#92840) Sort dependencies in package.json correctly (elastic#93590) [Bug] missing timepicker:quickRanges migration (elastic#93409) ...

* wip * Moving catchError so observable stream does not complete. Adding retry on failure * Using retryWhen. Updating unit tests * PR fixes Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: ymao1 <[email protected]>

ymao1 added 3 commits February 26, 2021 08:54

wip

62acac7

Moving catchError so observable stream does not complete. Adding retr…

02ff6c2

…y on failure

Using retryWhen. Updating unit tests

efff617

ymao1 self-assigned this Mar 2, 2021

ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 v8.0.0 release_note:skip Skip the PR/issue when compiling release notes labels Mar 2, 2021

ymao1 marked this pull request as ready for review March 2, 2021 20:20

ymao1 requested a review from a team as a code owner March 2, 2021 20:20

Merge branch 'master' into alerting/health-check

b142573

YulNaumenko approved these changes Mar 2, 2021

View reviewed changes

ymao1 mentioned this pull request Mar 3, 2021

[test-failed]: Chrome UI Functional Tests1.test/functional/apps/status_page/index·ts - status page should display the server status #92299

Closed

gmmorris reviewed Mar 3, 2021

View reviewed changes

gmmorris approved these changes Mar 3, 2021

View reviewed changes

ymao1 added 2 commits March 4, 2021 08:03

Merge branch 'master' of https://github.com/elastic/kibana into alert…

660530c

…ing/health-check

PR fixes

ab58c7f

ymao1 added the auto-backport Deprecated - use backport:version if exact versions are needed label Mar 4, 2021

ymao1 merged commit cad2653 into elastic:master Mar 4, 2021

kibanamachine mentioned this pull request Mar 4, 2021

[7.x] [Alerting] Improving health status check (#93282) #93639

Merged

ymao1 deleted the alerting/health-check branch March 25, 2021 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Improving health status check #93282

[Alerting] Improving health status check #93282

ymao1 commented Mar 2, 2021 •

edited

Loading

elasticmachine commented Mar 2, 2021

ymao1 commented Mar 2, 2021

YulNaumenko left a comment

gmmorris Mar 3, 2021

gmmorris Mar 3, 2021

gmmorris left a comment

ymao1 commented Mar 3, 2021

pmuellr commented Mar 3, 2021

kibanamachine commented Mar 4, 2021

kibanamachine commented Mar 4, 2021

[Alerting] Improving health status check #93282

[Alerting] Improving health status check #93282

Conversation

ymao1 commented Mar 2, 2021 • edited Loading

Summary

Checklist

elasticmachine commented Mar 2, 2021

ymao1 commented Mar 2, 2021

YulNaumenko left a comment

Choose a reason for hiding this comment

gmmorris Mar 3, 2021

Choose a reason for hiding this comment

gmmorris Mar 3, 2021

Choose a reason for hiding this comment

gmmorris left a comment

Choose a reason for hiding this comment

ymao1 commented Mar 3, 2021

pmuellr commented Mar 3, 2021

kibanamachine commented Mar 4, 2021

💚 Build Succeeded

Metrics [docs]

History

kibanamachine commented Mar 4, 2021

💚 Backport successful

ymao1 commented Mar 2, 2021 •

edited

Loading