Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] results from taskManager.bulk*() calls ignored in rules client #145316

Closed
pmuellr opened this issue Nov 16, 2022 · 1 comment · Fixed by #147808
Closed

[ResponseOps] results from taskManager.bulk*() calls ignored in rules client #145316

pmuellr opened this issue Nov 16, 2022 · 1 comment · Fixed by #147808
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting/RulesManagement Issues related to the Rules Management UX Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Nov 16, 2022

Kibana version: main

In the rules client module, there are several calls to this.taskManager.bulk*() methods which are not capturing the results of the call. However, one of the operations in the bulk could fail, and the call would still not throw an error. I think an error is only thrown for bulk for catastrophic cases, but in the normal case a response is returned independently for each operation in the bulk. Which means we could be treating errors as successes.

Here's an example, but there is also code for this.taskManager methods bulkEnable() and bulkDisable():

try {
await this.taskManager.bulkUpdateSchedules(taskIds, scheduleOperation.value);
this.logger.debug(
`Successfully updated schedules for underlying tasks: ${taskIds.join(', ')}`
);
} catch (error) {
this.logger.error(
`Failure to update schedules for underlying tasks: ${taskIds.join(
', '
)}. TaskManager bulkUpdateSchedules failed with Error: ${error.message}`
);
}
}

To handle this, we are going to want to retry the operations that failed, in case the failure was related to Optimistic Concurrency Control (OCC). We have other OCC lying around the rules client code, and so we should copy the pattern. Though I don't know that we have "bulk" versions, which will be a little different in that they would retry just the ones that failed.

I think there's a chance this is the cause of #141849 (and related failures) and why we haven't be able to find a fix for those in #144739. One explanation would be that we are in fact getting an OCC error, for a task that is either just starting, or just finishing, and updating it's own task doc. And then the bulk operation also gets called. Each one is getting the task doc, and then updating it. Probably pretty rare, but the failures have been fairly rare.

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting/RulesManagement Issues related to the Rules Management UX labels Nov 16, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Nov 17, 2022
@doakalexi doakalexi assigned doakalexi and unassigned doakalexi Dec 7, 2022
@ymao1 ymao1 self-assigned this Dec 19, 2022
@ymao1 ymao1 moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Dec 19, 2022
@ymao1 ymao1 moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Dec 20, 2022
Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Dec 21, 2022
ymao1 added a commit that referenced this issue Dec 21, 2022
Resolves #145316,
#141849,
#141864

## Summary

Adds a retry on conflict error to the saved objects bulk update call
made by task manager. Errors are returned by the saved object client
inside an array (with a success response). Previously, we were not
inspecting the response array, just returning the full data. With this
PR, we are inspecting the response array specifically for conflict
errors and retrying the update for just those tasks.

This `bulkUpdate` function is used both internally by task manager and
externally by the rules client. I default the number of retries to 0 for
bulk updates from the task manager in order to preserve existing
behavior (and in order not to increase the amount of time it takes for
task manager to run) but use 3 retries when used externally.

Also unskipped the two flaky disable tests and ran them through the
flaky test runner 400 times with no failures.
*
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1652
*
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1653
*
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1660
*
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1661

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting/RulesManagement Issues related to the Rules Management UX Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants