Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] retry internal OCC calls within alertsClient #77838

Merged
merged 4 commits into from
Sep 28, 2020

Conversation

pmuellr
Copy link
Member

@pmuellr pmuellr commented Sep 17, 2020

Summary

During development of #75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the version
option/property of the saved object. The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs. In some alertsClient methods, OCC is used to update
the alert which are requested via user requests. And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods. Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we wrap all the methods using OCC with a function that
will retry them, a short number of times, with a short delay in
between. If the original method STILL has a conflict error, it
will get thrown after the retry limit. In practice, this eliminated
the version conflict calls that were occuring with the SIEM tests,
once we started updating the saved object in the executor.

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That will be also used by PR #75553.

Checklist

Delete any items that are not applicable to this PR.

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me. The PR is still draft so some of my review comments may be too early.

I think this will provide more stability to our APIs compared to the previous approach of looking at one version of the data and forcing an update. During conflicts, it can properly manage API keys, tasks, etc compared to the other approach of not handling those scenarios.

I have a comment about the invalidation of newly created API keys when getting a 409 but that problem already exists. I think a follow up issue / PR is fine for this.

I will finish my review (mostly test files) once the PR is ready for review 👍

@pmuellr pmuellr force-pushed the alerting/partial-update-and-occ-2 branch 4 times, most recently from 8939df7 to f791806 Compare September 22, 2020 03:22
During development of elastic#75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the `version`
option/property of the saved object. The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs. In some alertsClient methods, OCC is used to update
the alert which are requested via user requests. And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods. Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we wrap all the methods using OCC with a function that
will retry them, a short number of times, with a short delay in
between. If the original method STILL has a conflict error, it
will get thrown after the retry limit.  In practice, this eliminated
the version conflict calls that were occuring with the SIEM tests,
once we started updating the saved object in the executor.

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That will be also used by PR elastic#75553.

interim commits:

- add jest tests for partially_update_alert
- add jest tests for retry_if_conflicts
- in-progress on jest tests for alertsClient with conflicts
- get tests for alerts_client_conflict_retries running
- add tests for retry logger messages
- resolve some PR comments
- fix alertsClient test to use version with muteAll/unmuteAll
- change test to reflect new use of create api
- fix alertsClient meta updates, add comments
@pmuellr pmuellr force-pushed the alerting/partial-update-and-occ-2 branch from f791806 to d26fa1f Compare September 22, 2020 03:24
@pmuellr pmuellr added Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0 labels Sep 22, 2020
@pmuellr pmuellr marked this pull request as ready for review September 22, 2020 03:26
@pmuellr pmuellr requested a review from a team as a code owner September 22, 2020 03:26
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}

// delay a bit before retrying
logger.warn(`${name} conflict, retrying ...`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if this will get too noisy at warning level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconded... could this be debug?

Copy link
Member Author

@pmuellr pmuellr Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this one to a debug; and changing the error() call when the retries are exceeded, to a warn. Rationale on that is that in theory the resulting conflict error will be thrown to the caller, so it's not necessarily an error. But it probably does indicate something is off - perhaps needing to increase the retries ... so it would nice for it to be visible.

Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments... I understand the why, but I feel like this adds complexity in places that will make it harder to maintain in the long term.

Can we find way to reduce this?

Comment on lines +81 to +91
function getMockSavedObjectClients(): Record<
string,
jest.Mocked<SavedObjectsClientContract | ISavedObjectsRepository>
> {
return {
SavedObjectsClientContract: MockSavedObjectsClientContract,
// doesn't appear to be a mock for this, but it's basically the same as the above,
// so just cast it to make sure we catch any type errors
ISavedObjectsRepository: MockISavedObjectsRepository,
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but why do we need to support both the client and repo? 🤔
Looks like we're only using the client, but supporting both adds complexity in both the tests and the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry - this is needed for the alert status PR - I copied the code from there, and it needs the repository, because ... well, not quite sure, but API wise, I could only get a repository and not a client (need a kibana system SO "client" to be used in the alert executor task). We'll see this again when that PR is ready for review, we could look at removing it then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to a type which is just the update() method in SavedObjectsClient - which will also work for SavedObjectsRepository once we get there.

Comment on lines +14 to +19
export const AlertAttributesExcludedFromAAD = [
'scheduledTaskId',
'muteAll',
'mutedInstanceIds',
'updatedBy',
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is doing what you had in mind, the type signature looks wrong.

I suspect you meant this though:

Suggested change
export const AlertAttributesExcludedFromAAD = [
'scheduledTaskId',
'muteAll',
'mutedInstanceIds',
'updatedBy',
];
export const AlertAttributesExcludedFromAAD = [
'scheduledTaskId',
'muteAll',
'mutedInstanceIds',
'updatedBy',
] as const;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Any chance this is a case of vscode being confused vs a typescript error. I'll take another look, though, see if it's actually allowing other attrs in.

I copy pasta'd this from an example I had seen about a week ago - if it's too obscure anyway (I'd say it's fairly obscure), I can switch this back to the explicit verison where we list the attr names twice - it's not too bad with just 4 attrs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use enums here, or change to use the | syntax

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed from the confusing

export type AlertAttributesExcludedFromAADType = typeof AlertAttributesExcludedFromAAD[number];

to the more verbose, but clearer:

export type AlertAttributesExcludedFromAADType =
  | 'scheduledTaskId'
  | 'muteAll'
  | 'mutedInstanceIds'
  | 'updatedBy';

Both the attribute names as an array of strings, and this type definition, are right next to each other in the code, so hopefully it will be simple to keep these in sync if they need to be changed in the future.

Comment on lines 45 to 53

// must be a conflict; if no retries left, throw it
if (retries <= 0) {
logger.error(`${name} conflict, exceeded retries`);
throw error;
}

// delay a bit before retrying
logger.warn(`${name} conflict, retrying ...`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find the flow confusing here - seeing as we return in the try, could we keep everything after it in the catch? That will allow us to avoid the let error in the upper scope, and would (subjectively) read a little better to me.

type RetryableForConflicts<T> = () => Promise<T>;

// number of times to retry when conflicts occur
export const RetryForConflictsAttempts = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels high 🤔

I get why we're allowing it a second attempt, but 5 times feels like we might be addressing a symptom instead of the root cause. Any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I was just thinking today, seems likely you'd never have to retry more than once. And 5 does seem pretty high. If you can't update within 5 retries, probably something is very wrong with your system :-).

OTOH, if you do get to this horrible place, these APIs could provide some "breathing room" for the system to fix itself - that's a bit of a stretch I'd say.

What kinda number you thinking? I could see going as low as 2, I think 3 feels a little better to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After conversation with Gidi, changed the retries to 1 - we'll find out from the SIEM tests once we add the alert execution status update, if this is good enough.

Comment on lines 58 to 61
async function waitBeforeNextRetry(): Promise<void> {
const millis = random(RetryForConflictsDelayMin, RetryForConflictsDelayMax);
await new Promise((resolve) => setTimeout(resolve, millis));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something... why have we introduced randomness into this? 🤨

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mike suggested it, in case we get a stampede somehow, would be nice to spread the retries out a bit, otherwise we'll get 5 stampedes :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth a comment, will add

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per discussion w/Gidi, going to change this to 1 retry, we can ratchet up later if needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After conversation with Gidi, with a smaller number of retries, adding randomness doesn't really help anything. So changed this to a static 250ms, but left a comment that we did consider randomness to help prevent stampedes.

);
}

private async updateWithOCC({ id, data }: UpdateOptions): Promise<PartialAlert> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as AlertsClient is already a huge file, is there a way we can add the retry into this without adding an extra method for each operation? 🤔
Perhaps by using some kind of decorator that wraps these methods or something along those lines?
This will also become quite a bit more complicated when we introduce optional version to these methods, if we can find a way to reduce this into a single extension, it will help with maintenance down the line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support decorators? I'm not sure inlining these is going to save all that space, is it? My thinking was that once we add versions to these methods, that basically just means to set retries to 0, so will likely be a straight-forward change to each of the sites.

The reason I separated these out was to try to keep the actual methods a little cleaner - wrapping like this obviously doesn't scale real well, beyond the first time to you do it (now!). I can go either way on inlining, but I think the current shape is a little more readable than inlined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After conversation with Gidi, we decided to think about some alternatives to this approach, but couldn't come up with anything that would require even more changes to the alerts client.

Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 👍
Thanks for the changes :D

@pmuellr
Copy link
Member Author

pmuellr commented Sep 28, 2020

@elasticmachine merge upstream

@pmuellr
Copy link
Member Author

pmuellr commented Sep 28, 2020

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

distributable file count

id value diff baseline
default 45673 +2 45671

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@pmuellr pmuellr merged commit feab3e3 into elastic:master Sep 28, 2020
pmuellr added a commit to pmuellr/kibana that referenced this pull request Sep 28, 2020
During development of elastic#75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the `version`
option/property of the saved object. The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs. In some alertsClient methods, OCC is used to update
the alert which are requested via user requests. And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods. Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we wrap all the methods using OCC with a function that
will retry them, a short number of times, with a short delay in
between. If the original method STILL has a conflict error, it
will get thrown after the retry limit.  In practice, this eliminated
the version conflict calls that were occurring with the SIEM tests,
once we started updating the saved object in the executor.

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That will be also used by PR elastic#75553.
pmuellr added a commit that referenced this pull request Sep 28, 2020
)

During development of #75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the `version`
option/property of the saved object. The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs. In some alertsClient methods, OCC is used to update
the alert which are requested via user requests. And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods. Note:
the SIEM function test cases stress test this REALLY well.

In this PR, we wrap all the methods using OCC with a function that
will retry them, a short number of times, with a short delay in
between. If the original method STILL has a conflict error, it
will get thrown after the retry limit.  In practice, this eliminated
the version conflict calls that were occurring with the SIEM tests,
once we started updating the saved object in the executor.

For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That will be also used by PR #75553.
phillipb added a commit to phillipb/kibana that referenced this pull request Sep 29, 2020
…a into add-anomalies-to-timeline

* 'add-anomalies-to-timeline' of github.com:phillipb/kibana: (89 commits)
  Aligns several module versions across the repository (elastic#78327)
  Empty prompt and loading spinner for service map (elastic#78382)
  Change progress bar to spinner (elastic#78460)
  [QA][Code Coverage] Coverage teams lookup w/o Additional Config (elastic#77111)
  Slim down core bundle (elastic#75912)
  [Alerting] retry internal OCC calls within alertsClient (elastic#77838)
  [kbn/optimizer] only build xpack examples when building xpack plugins (elastic#78656)
  [Ingest Manager] Ingest setup upgrade (elastic#78081)
  [Ingest Manager] Surface saved object client 10,000 limitation to bulk actions UI (elastic#78520)
  fix name without a category or if field end with .text (elastic#78655)
  [Security Solution] [Detections] Log message enhancements (elastic#78429)
  [ML]DF Analytics exploration: default filter of results page by `defaultIsTraining` value in url (elastic#78303)
  [Enterprise Search] Remove all instances of KibanaContext to Kea store (elastic#78513)
  [ML] DF Analytics creation: ensure job did not fail to start before showing results link (elastic#78200)
  fix createAppNavigationHandler to use `navigateToUrl` (elastic#78583)
  Fixing a11y test failure on discover app (elastic#59975) (elastic#77614)
  [Security Solution] Initiate endpoint package upgrade from security app (elastic#77498)
  [kbn/es] use a basic build process (elastic#78090)
  [kbn/optimizer] fix .json extension handling (elastic#78524)
  Fix APM lodash imports (elastic#78438)
  ...
gmmorris added a commit to gmmorris/kibana that referenced this pull request Sep 29, 2020
* master: (365 commits)
  making expression debug info serializable (elastic#78727)
  fix lodahs imports in app-arch code (elastic#78582)
  Make Field a React.lazy export (elastic#78483)
  [Security Solution] Improves detections tests (elastic#77295)
  [TSVB] Different field format on different series is ignored (elastic#78138)
  RFC: Improve saved object migrations (elastic#66056)
  [Security Solution] Fixes url timeline flaky test (elastic#78556)
  adds retryability feature (elastic#78611)
  Aligns several module versions across the repository (elastic#78327)
  Empty prompt and loading spinner for service map (elastic#78382)
  Change progress bar to spinner (elastic#78460)
  [QA][Code Coverage] Coverage teams lookup w/o Additional Config (elastic#77111)
  Slim down core bundle (elastic#75912)
  [Alerting] retry internal OCC calls within alertsClient (elastic#77838)
  [kbn/optimizer] only build xpack examples when building xpack plugins (elastic#78656)
  [Ingest Manager] Ingest setup upgrade (elastic#78081)
  [Ingest Manager] Surface saved object client 10,000 limitation to bulk actions UI (elastic#78520)
  fix name without a category or if field end with .text (elastic#78655)
  [Security Solution] [Detections] Log message enhancements (elastic#78429)
  [ML]DF Analytics exploration: default filter of results page by `defaultIsTraining` value in url (elastic#78303)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants