[Alerting] retry internal OCC calls within alertsClient #77838

pmuellr · 2020-09-17T23:18:57Z

Summary

During development of #75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the version
option/property of the saved object. The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs. In some alertsClient methods, OCC is used to update
the alert which are requested via user requests. And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods. Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we wrap all the methods using OCC with a function that
will retry them, a short number of times, with a short delay in
between. If the original method STILL has a conflict error, it
will get thrown after the retry limit. In practice, this eliminated
the version conflict calls that were occuring with the SIEM tests,
once we started updating the saved object in the executor.

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That will be also used by PR #75553.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Unit or functional tests were updated or added to match the most common scenarios

mikecote

The approach looks good to me. The PR is still draft so some of my review comments may be too early.

I think this will provide more stability to our APIs compared to the previous approach of looking at one version of the data and forcing an update. During conflicts, it can properly manage API keys, tasks, etc compared to the other approach of not handling those scenarios.

I have a comment about the invalidation of newly created API keys when getting a 409 but that problem already exists. I think a follow up issue / PR is fine for this.

I will finish my review (mostly test files) once the PR is ready for review 👍

x-pack/plugins/alerts/server/saved_objects/partially_update_alert.ts

x-pack/plugins/alerts/server/lib/retry_if_conflicts.ts

x-pack/plugins/alerts/server/alerts_client.ts

x-pack/plugins/alerts/server/lib/retry_if_conflicts.ts

During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occuring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR elastic#75553. interim commits: - add jest tests for partially_update_alert - add jest tests for retry_if_conflicts - in-progress on jest tests for alertsClient with conflicts - get tests for alerts_client_conflict_retries running - add tests for retry logger messages - resolve some PR comments - fix alertsClient test to use version with muteAll/unmuteAll - change test to reflect new use of create api - fix alertsClient meta updates, add comments

elasticmachine · 2020-09-22T03:26:45Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote

LGTM!

mikecote · 2020-09-22T14:17:02Z

x-pack/plugins/alerts/server/lib/retry_if_conflicts.ts

+  }
+
+  // delay a bit before retrying
+  logger.warn(`${name} conflict, retrying ...`);


nit: I wonder if this will get too noisy at warning level?

seconded... could this be debug?

changed this one to a debug; and changing the error() call when the retries are exceeded, to a warn. Rationale on that is that in theory the resulting conflict error will be thrown to the caller, so it's not necessarily an error. But it probably does indicate something is off - perhaps needing to increase the retries ... so it would nice for it to be visible.

gmmorris

Left a bunch of comments... I understand the why, but I feel like this adds complexity in places that will make it harder to maintain in the long term.

Can we find way to reduce this?

gmmorris · 2020-09-22T16:32:35Z

x-pack/plugins/alerts/server/saved_objects/partially_update_alert.test.ts

+function getMockSavedObjectClients(): Record<
+  string,
+  jest.Mocked<SavedObjectsClientContract | ISavedObjectsRepository>
+> {
+  return {
+    SavedObjectsClientContract: MockSavedObjectsClientContract,
+    // doesn't appear to be a mock for this, but it's basically the same as the above,
+    // so just cast it to make sure we catch any type errors
+    ISavedObjectsRepository: MockISavedObjectsRepository,
+  };
+}


I might be missing something, but why do we need to support both the client and repo? 🤔
Looks like we're only using the client, but supporting both adds complexity in both the tests and the implementation.

Ah sorry - this is needed for the alert status PR - I copied the code from there, and it needs the repository, because ... well, not quite sure, but API wise, I could only get a repository and not a client (need a kibana system SO "client" to be used in the alert executor task). We'll see this again when that PR is ready for review, we could look at removing it then.

I've changed this to a type which is just the update() method in SavedObjectsClient - which will also work for SavedObjectsRepository once we get there.

gmmorris · 2020-09-22T16:42:09Z

x-pack/plugins/alerts/server/saved_objects/index.ts

+export const AlertAttributesExcludedFromAAD = [
+  'scheduledTaskId',
+  'muteAll',
+  'mutedInstanceIds',
+  'updatedBy',
+];


I don't think this is doing what you had in mind, the type signature looks wrong.

I suspect you meant this though:

Suggested change

export const AlertAttributesExcludedFromAAD = [

'scheduledTaskId',

'muteAll',

'mutedInstanceIds',

'updatedBy',

];

export const AlertAttributesExcludedFromAAD = [

'scheduledTaskId',

'muteAll',

'mutedInstanceIds',

'updatedBy',

] as const;

Huh. Any chance this is a case of vscode being confused vs a typescript error. I'll take another look, though, see if it's actually allowing other attrs in.

I copy pasta'd this from an example I had seen about a week ago - if it's too obscure anyway (I'd say it's fairly obscure), I can switch this back to the explicit verison where we list the attr names twice - it's not too bad with just 4 attrs.

we could use enums here, or change to use the | syntax

I changed from the confusing

export type AlertAttributesExcludedFromAADType = typeof AlertAttributesExcludedFromAAD[number];

to the more verbose, but clearer:

export type AlertAttributesExcludedFromAADType = | 'scheduledTaskId' | 'muteAll' | 'mutedInstanceIds' | 'updatedBy';

Both the attribute names as an array of strings, and this type definition, are right next to each other in the code, so hopefully it will be simple to keep these in sync if they need to be changed in the future.

gmmorris · 2020-09-22T16:44:59Z

x-pack/plugins/alerts/server/lib/retry_if_conflicts.ts

+
+  // must be a conflict; if no retries left, throw it
+  if (retries <= 0) {
+    logger.error(`${name} conflict, exceeded retries`);
+    throw error;
+  }
+
+  // delay a bit before retrying
+  logger.warn(`${name} conflict, retrying ...`);


nit: I find the flow confusing here - seeing as we return in the try, could we keep everything after it in the catch? That will allow us to avoid the let error in the upper scope, and would (subjectively) read a little better to me.

gmmorris · 2020-09-22T16:46:23Z

x-pack/plugins/alerts/server/lib/retry_if_conflicts.ts

+type RetryableForConflicts<T> = () => Promise<T>;
+
+// number of times to retry when conflicts occur
+export const RetryForConflictsAttempts = 5;


This feels high 🤔

I get why we're allowing it a second attempt, but 5 times feels like we might be addressing a symptom instead of the root cause. Any thoughts?

Ya, I was just thinking today, seems likely you'd never have to retry more than once. And 5 does seem pretty high. If you can't update within 5 retries, probably something is very wrong with your system :-).

OTOH, if you do get to this horrible place, these APIs could provide some "breathing room" for the system to fix itself - that's a bit of a stretch I'd say.

What kinda number you thinking? I could see going as low as 2, I think 3 feels a little better to me.

After conversation with Gidi, changed the retries to 1 - we'll find out from the SIEM tests once we add the alert execution status update, if this is good enough.

gmmorris · 2020-09-22T16:47:29Z

x-pack/plugins/alerts/server/lib/retry_if_conflicts.ts

+async function waitBeforeNextRetry(): Promise<void> {
+  const millis = random(RetryForConflictsDelayMin, RetryForConflictsDelayMax);
+  await new Promise((resolve) => setTimeout(resolve, millis));
+}


I might be missing something... why have we introduced randomness into this? 🤨

Mike suggested it, in case we get a stampede somehow, would be nice to spread the retries out a bit, otherwise we'll get 5 stampedes :-)

worth a comment, will add

per discussion w/Gidi, going to change this to 1 retry, we can ratchet up later if needed

After conversation with Gidi, with a smaller number of retries, adding randomness doesn't really help anything. So changed this to a static 250ms, but left a comment that we did consider randomness to help prevent stampedes.

gmmorris · 2020-09-22T16:51:21Z

x-pack/plugins/alerts/server/alerts_client.ts

+    );
+  }
+
+  private async updateWithOCC({ id, data }: UpdateOptions): Promise<PartialAlert> {


Seeing as AlertsClient is already a huge file, is there a way we can add the retry into this without adding an extra method for each operation? 🤔
Perhaps by using some kind of decorator that wraps these methods or something along those lines?
This will also become quite a bit more complicated when we introduce optional version to these methods, if we can find a way to reduce this into a single extension, it will help with maintenance down the line.

Do we support decorators? I'm not sure inlining these is going to save all that space, is it? My thinking was that once we add versions to these methods, that basically just means to set retries to 0, so will likely be a straight-forward change to each of the sites.

The reason I separated these out was to try to keep the actual methods a little cleaner - wrapping like this obviously doesn't scale real well, beyond the first time to you do it (now!). I can go either way on inlining, but I think the current shape is a little more readable than inlined.

After conversation with Gidi, we decided to think about some alternatives to this approach, but couldn't come up with anything that would require even more changes to the alerts client.

gmmorris

Looking good 👍
Thanks for the changes :D

pmuellr · 2020-09-28T15:20:27Z

@elasticmachine merge upstream

pmuellr · 2020-09-28T16:28:17Z

@elasticmachine merge upstream

kibanamachine · 2020-09-28T18:00:22Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 98e351d

Metrics [docs]

distributable file count

id	value	diff	baseline
default	45673	+2	45671

History

💔 Build #77467 failed 7af6a9d
💔 Build #77482 failed d6194e1
💚 Build #76071 succeeded d26fa1f
💚 Build #76060 succeeded 8939df71e7c14a1c1d12badfafcea0b44c3a54e8
💚 Build #75990 succeeded 36283fb7497bd98921ba890d526a3dbbaedc4e40

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occurring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR elastic#75553.

) During development of #75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occurring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR #75553.

…a into add-anomalies-to-timeline * 'add-anomalies-to-timeline' of github.com:phillipb/kibana: (89 commits) Aligns several module versions across the repository (elastic#78327) Empty prompt and loading spinner for service map (elastic#78382) Change progress bar to spinner (elastic#78460) [QA][Code Coverage] Coverage teams lookup w/o Additional Config (elastic#77111) Slim down core bundle (elastic#75912) [Alerting] retry internal OCC calls within alertsClient (elastic#77838) [kbn/optimizer] only build xpack examples when building xpack plugins (elastic#78656) [Ingest Manager] Ingest setup upgrade (elastic#78081) [Ingest Manager] Surface saved object client 10,000 limitation to bulk actions UI (elastic#78520) fix name without a category or if field end with .text (elastic#78655) [Security Solution] [Detections] Log message enhancements (elastic#78429) [ML]DF Analytics exploration: default filter of results page by `defaultIsTraining` value in url (elastic#78303) [Enterprise Search] Remove all instances of KibanaContext to Kea store (elastic#78513) [ML] DF Analytics creation: ensure job did not fail to start before showing results link (elastic#78200) fix createAppNavigationHandler to use `navigateToUrl` (elastic#78583) Fixing a11y test failure on discover app (elastic#59975) (elastic#77614) [Security Solution] Initiate endpoint package upgrade from security app (elastic#77498) [kbn/es] use a basic build process (elastic#78090) [kbn/optimizer] fix .json extension handling (elastic#78524) Fix APM lodash imports (elastic#78438) ...

* master: (365 commits) making expression debug info serializable (elastic#78727) fix lodahs imports in app-arch code (elastic#78582) Make Field a React.lazy export (elastic#78483) [Security Solution] Improves detections tests (elastic#77295) [TSVB] Different field format on different series is ignored (elastic#78138) RFC: Improve saved object migrations (elastic#66056) [Security Solution] Fixes url timeline flaky test (elastic#78556) adds retryability feature (elastic#78611) Aligns several module versions across the repository (elastic#78327) Empty prompt and loading spinner for service map (elastic#78382) Change progress bar to spinner (elastic#78460) [QA][Code Coverage] Coverage teams lookup w/o Additional Config (elastic#77111) Slim down core bundle (elastic#75912) [Alerting] retry internal OCC calls within alertsClient (elastic#77838) [kbn/optimizer] only build xpack examples when building xpack plugins (elastic#78656) [Ingest Manager] Ingest setup upgrade (elastic#78081) [Ingest Manager] Surface saved object client 10,000 limitation to bulk actions UI (elastic#78520) fix name without a category or if field end with .text (elastic#78655) [Security Solution] [Detections] Log message enhancements (elastic#78429) [ML]DF Analytics exploration: default filter of results page by `defaultIsTraining` value in url (elastic#78303) ...

pmuellr mentioned this pull request Sep 17, 2020

[Alerting] remove internal OCC issues with alertsClient #76830

Closed

1 task

mikecote reviewed Sep 18, 2020

View reviewed changes

pmuellr force-pushed the alerting/partial-update-and-occ-2 branch 4 times, most recently from 8939df7 to f791806 Compare September 22, 2020 03:22

pmuellr force-pushed the alerting/partial-update-and-occ-2 branch from f791806 to d26fa1f Compare September 22, 2020 03:24

pmuellr added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0 labels Sep 22, 2020

pmuellr marked this pull request as ready for review September 22, 2020 03:26

pmuellr requested a review from a team as a code owner September 22, 2020 03:26

mikecote approved these changes Sep 22, 2020

View reviewed changes

gmmorris suggested changes Sep 22, 2020

View reviewed changes

pmuellr mentioned this pull request Sep 22, 2020

[Alerting] Expose SO _version to clients #74381

Open

changes from PR comments

7af6a9d

gmmorris approved these changes Sep 28, 2020

View reviewed changes

Merge branch 'master' into alerting/partial-update-and-occ-2

d6194e1

Merge branch 'master' into alerting/partial-update-and-occ-2

98e351d

pmuellr merged commit feab3e3 into elastic:master Sep 28, 2020

pmuellr mentioned this pull request Sep 28, 2020

[7.x] [Alerting] retry internal OCC calls within alertsClient (#77838) #78688

Merged

pmuellr added the backported label Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] retry internal OCC calls within alertsClient #77838

[Alerting] retry internal OCC calls within alertsClient #77838

pmuellr commented Sep 17, 2020

mikecote left a comment

elasticmachine commented Sep 22, 2020

mikecote left a comment

mikecote Sep 22, 2020

gmmorris Sep 22, 2020

pmuellr Sep 28, 2020 •

edited

Loading

gmmorris left a comment

gmmorris Sep 22, 2020

pmuellr Sep 22, 2020

pmuellr Sep 28, 2020

gmmorris Sep 22, 2020

pmuellr Sep 22, 2020

pmuellr Sep 23, 2020

pmuellr Sep 28, 2020

gmmorris Sep 22, 2020

gmmorris Sep 22, 2020

pmuellr Sep 22, 2020

pmuellr Sep 28, 2020

gmmorris Sep 22, 2020

pmuellr Sep 22, 2020

pmuellr Sep 22, 2020

pmuellr Sep 23, 2020

pmuellr Sep 28, 2020

gmmorris Sep 22, 2020

pmuellr Sep 22, 2020

pmuellr Sep 28, 2020

gmmorris left a comment

pmuellr commented Sep 28, 2020

pmuellr commented Sep 28, 2020

kibanamachine commented Sep 28, 2020

[Alerting] retry internal OCC calls within alertsClient #77838

[Alerting] retry internal OCC calls within alertsClient #77838

Conversation

pmuellr commented Sep 17, 2020

Summary

Checklist

mikecote left a comment

Choose a reason for hiding this comment

elasticmachine commented Sep 22, 2020

mikecote left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmuellr Sep 28, 2020 • edited Loading

Choose a reason for hiding this comment

gmmorris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmmorris left a comment

Choose a reason for hiding this comment

pmuellr commented Sep 28, 2020

pmuellr commented Sep 28, 2020

kibanamachine commented Sep 28, 2020

💚 Build Succeeded

Metrics [docs]

distributable file count

History

pmuellr Sep 28, 2020 •

edited

Loading