-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] retry internal OCC calls within alertsClient #77838
[Alerting] retry internal OCC calls within alertsClient #77838
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me. The PR is still draft so some of my review comments may be too early.
I think this will provide more stability to our APIs compared to the previous approach of looking at one version of the data and forcing an update. During conflicts, it can properly manage API keys, tasks, etc compared to the other approach of not handling those scenarios.
I have a comment about the invalidation of newly created API keys when getting a 409 but that problem already exists. I think a follow up issue / PR is fine for this.
I will finish my review (mostly test files) once the PR is ready for review 👍
x-pack/plugins/alerts/server/saved_objects/partially_update_alert.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/alerts/server/saved_objects/partially_update_alert.ts
Outdated
Show resolved
Hide resolved
8939df7
to
f791806
Compare
During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occuring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR elastic#75553. interim commits: - add jest tests for partially_update_alert - add jest tests for retry_if_conflicts - in-progress on jest tests for alertsClient with conflicts - get tests for alerts_client_conflict_retries running - add tests for retry logger messages - resolve some PR comments - fix alertsClient test to use version with muteAll/unmuteAll - change test to reflect new use of create api - fix alertsClient meta updates, add comments
f791806
to
d26fa1f
Compare
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
} | ||
|
||
// delay a bit before retrying | ||
logger.warn(`${name} conflict, retrying ...`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wonder if this will get too noisy at warning level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seconded... could this be debug
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this one to a debug; and changing the error()
call when the retries are exceeded, to a warn
. Rationale on that is that in theory the resulting conflict error will be thrown to the caller, so it's not necessarily an error. But it probably does indicate something is off - perhaps needing to increase the retries ... so it would nice for it to be visible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments... I understand the why, but I feel like this adds complexity in places that will make it harder to maintain in the long term.
Can we find way to reduce this?
function getMockSavedObjectClients(): Record< | ||
string, | ||
jest.Mocked<SavedObjectsClientContract | ISavedObjectsRepository> | ||
> { | ||
return { | ||
SavedObjectsClientContract: MockSavedObjectsClientContract, | ||
// doesn't appear to be a mock for this, but it's basically the same as the above, | ||
// so just cast it to make sure we catch any type errors | ||
ISavedObjectsRepository: MockISavedObjectsRepository, | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but why do we need to support both the client and repo? 🤔
Looks like we're only using the client, but supporting both adds complexity in both the tests and the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry - this is needed for the alert status PR - I copied the code from there, and it needs the repository, because ... well, not quite sure, but API wise, I could only get a repository and not a client (need a kibana system SO "client" to be used in the alert executor task). We'll see this again when that PR is ready for review, we could look at removing it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to a type which is just the update()
method in SavedObjectsClient - which will also work for SavedObjectsRepository once we get there.
export const AlertAttributesExcludedFromAAD = [ | ||
'scheduledTaskId', | ||
'muteAll', | ||
'mutedInstanceIds', | ||
'updatedBy', | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is doing what you had in mind, the type signature looks wrong.
I suspect you meant this though:
export const AlertAttributesExcludedFromAAD = [ | |
'scheduledTaskId', | |
'muteAll', | |
'mutedInstanceIds', | |
'updatedBy', | |
]; | |
export const AlertAttributesExcludedFromAAD = [ | |
'scheduledTaskId', | |
'muteAll', | |
'mutedInstanceIds', | |
'updatedBy', | |
] as const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. Any chance this is a case of vscode being confused vs a typescript error. I'll take another look, though, see if it's actually allowing other attrs in.
I copy pasta'd this from an example I had seen about a week ago - if it's too obscure anyway (I'd say it's fairly obscure), I can switch this back to the explicit verison where we list the attr names twice - it's not too bad with just 4 attrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could use enums here, or change to use the |
syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed from the confusing
export type AlertAttributesExcludedFromAADType = typeof AlertAttributesExcludedFromAAD[number];
to the more verbose, but clearer:
export type AlertAttributesExcludedFromAADType =
| 'scheduledTaskId'
| 'muteAll'
| 'mutedInstanceIds'
| 'updatedBy';
Both the attribute names as an array of strings, and this type definition, are right next to each other in the code, so hopefully it will be simple to keep these in sync if they need to be changed in the future.
|
||
// must be a conflict; if no retries left, throw it | ||
if (retries <= 0) { | ||
logger.error(`${name} conflict, exceeded retries`); | ||
throw error; | ||
} | ||
|
||
// delay a bit before retrying | ||
logger.warn(`${name} conflict, retrying ...`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find the flow confusing here - seeing as we return in the try
, could we keep everything after it in the catch
? That will allow us to avoid the let error
in the upper scope, and would (subjectively) read a little better to me.
type RetryableForConflicts<T> = () => Promise<T>; | ||
|
||
// number of times to retry when conflicts occur | ||
export const RetryForConflictsAttempts = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels high 🤔
I get why we're allowing it a second attempt, but 5 times feels like we might be addressing a symptom instead of the root cause. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I was just thinking today, seems likely you'd never have to retry more than once. And 5 does seem pretty high. If you can't update within 5 retries, probably something is very wrong with your system :-).
OTOH, if you do get to this horrible place, these APIs could provide some "breathing room" for the system to fix itself - that's a bit of a stretch I'd say.
What kinda number you thinking? I could see going as low as 2, I think 3 feels a little better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After conversation with Gidi, changed the retries to 1
- we'll find out from the SIEM tests once we add the alert execution status update, if this is good enough.
async function waitBeforeNextRetry(): Promise<void> { | ||
const millis = random(RetryForConflictsDelayMin, RetryForConflictsDelayMax); | ||
await new Promise((resolve) => setTimeout(resolve, millis)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something... why have we introduced randomness into this? 🤨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mike suggested it, in case we get a stampede somehow, would be nice to spread the retries out a bit, otherwise we'll get 5 stampedes :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth a comment, will add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per discussion w/Gidi, going to change this to 1 retry, we can ratchet up later if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After conversation with Gidi, with a smaller number of retries, adding randomness doesn't really help anything. So changed this to a static 250ms, but left a comment that we did consider randomness to help prevent stampedes.
); | ||
} | ||
|
||
private async updateWithOCC({ id, data }: UpdateOptions): Promise<PartialAlert> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as AlertsClient is already a huge file, is there a way we can add the retry into this without adding an extra method for each operation? 🤔
Perhaps by using some kind of decorator that wraps these methods or something along those lines?
This will also become quite a bit more complicated when we introduce optional version
to these methods, if we can find a way to reduce this into a single extension, it will help with maintenance down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support decorators? I'm not sure inlining these is going to save all that space, is it? My thinking was that once we add versions to these methods, that basically just means to set retries to 0, so will likely be a straight-forward change to each of the sites.
The reason I separated these out was to try to keep the actual methods a little cleaner - wrapping like this obviously doesn't scale real well, beyond the first time to you do it (now!). I can go either way on inlining, but I think the current shape is a little more readable than inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After conversation with Gidi, we decided to think about some alternatives to this approach, but couldn't come up with anything that would require even more changes to the alerts client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good 👍
Thanks for the changes :D
@elasticmachine merge upstream |
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]distributable file count
History
To update your PR or re-run it, just comment with: |
During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occurring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR elastic#75553.
) During development of #75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occurring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR #75553.
…a into add-anomalies-to-timeline * 'add-anomalies-to-timeline' of github.com:phillipb/kibana: (89 commits) Aligns several module versions across the repository (elastic#78327) Empty prompt and loading spinner for service map (elastic#78382) Change progress bar to spinner (elastic#78460) [QA][Code Coverage] Coverage teams lookup w/o Additional Config (elastic#77111) Slim down core bundle (elastic#75912) [Alerting] retry internal OCC calls within alertsClient (elastic#77838) [kbn/optimizer] only build xpack examples when building xpack plugins (elastic#78656) [Ingest Manager] Ingest setup upgrade (elastic#78081) [Ingest Manager] Surface saved object client 10,000 limitation to bulk actions UI (elastic#78520) fix name without a category or if field end with .text (elastic#78655) [Security Solution] [Detections] Log message enhancements (elastic#78429) [ML]DF Analytics exploration: default filter of results page by `defaultIsTraining` value in url (elastic#78303) [Enterprise Search] Remove all instances of KibanaContext to Kea store (elastic#78513) [ML] DF Analytics creation: ensure job did not fail to start before showing results link (elastic#78200) fix createAppNavigationHandler to use `navigateToUrl` (elastic#78583) Fixing a11y test failure on discover app (elastic#59975) (elastic#77614) [Security Solution] Initiate endpoint package upgrade from security app (elastic#77498) [kbn/es] use a basic build process (elastic#78090) [kbn/optimizer] fix .json extension handling (elastic#78524) Fix APM lodash imports (elastic#78438) ...
* master: (365 commits) making expression debug info serializable (elastic#78727) fix lodahs imports in app-arch code (elastic#78582) Make Field a React.lazy export (elastic#78483) [Security Solution] Improves detections tests (elastic#77295) [TSVB] Different field format on different series is ignored (elastic#78138) RFC: Improve saved object migrations (elastic#66056) [Security Solution] Fixes url timeline flaky test (elastic#78556) adds retryability feature (elastic#78611) Aligns several module versions across the repository (elastic#78327) Empty prompt and loading spinner for service map (elastic#78382) Change progress bar to spinner (elastic#78460) [QA][Code Coverage] Coverage teams lookup w/o Additional Config (elastic#77111) Slim down core bundle (elastic#75912) [Alerting] retry internal OCC calls within alertsClient (elastic#77838) [kbn/optimizer] only build xpack examples when building xpack plugins (elastic#78656) [Ingest Manager] Ingest setup upgrade (elastic#78081) [Ingest Manager] Surface saved object client 10,000 limitation to bulk actions UI (elastic#78520) fix name without a category or if field end with .text (elastic#78655) [Security Solution] [Detections] Log message enhancements (elastic#78429) [ML]DF Analytics exploration: default filter of results page by `defaultIsTraining` value in url (elastic#78303) ...
Summary
During development of #75553,
some issues came up with the optimistic concurrency control (OCC) we
were using internally within the alertsClient, via the
version
option/property of the saved object. The referenced PR updates new
fields in the alert from the taskManager task after the alertType
executor runs. In some alertsClient methods, OCC is used to update
the alert which are requested via user requests. And so in some
cases, version conflict errors were coming up when the alert was
updated by task manager, in the middle of one of these methods. Note:
the SIEM function test cases stress test this REALLY well.
In this PR, we wrap all the methods using OCC with a function that
will retry them, a short number of times, with a short delay in
between. If the original method STILL has a conflict error, it
will get thrown after the retry limit. In practice, this eliminated
the version conflict calls that were occuring with the SIEM tests,
once we started updating the saved object in the executor.
For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That will be also used by PR #75553.
Checklist
Delete any items that are not applicable to this PR.