[ResponseOps] results from taskManager.bulk*() calls ignored in rules client #145316
Labels
bug
Fixes for quality problems that affect the customer experience
Feature:Alerting/RulesFramework
Issues related to the Alerting Rules Framework
Feature:Alerting/RulesManagement
Issues related to the Rules Management UX
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Kibana version: main
In the rules client module, there are several calls to
this.taskManager.bulk*()
methods which are not capturing the results of the call. However, one of the operations in the bulk could fail, and the call would still not throw an error. I think an error is only thrown for bulk for catastrophic cases, but in the normal case a response is returned independently for each operation in the bulk. Which means we could be treating errors as successes.Here's an example, but there is also code for
this.taskManager
methodsbulkEnable()
andbulkDisable()
:kibana/x-pack/plugins/alerting/server/rules_client/rules_client.ts
Lines 2312 to 2324 in dd1ad53
To handle this, we are going to want to retry the operations that failed, in case the failure was related to Optimistic Concurrency Control (OCC). We have other OCC lying around the rules client code, and so we should copy the pattern. Though I don't know that we have "bulk" versions, which will be a little different in that they would retry just the ones that failed.
I think there's a chance this is the cause of #141849 (and related failures) and why we haven't be able to find a fix for those in #144739. One explanation would be that we are in fact getting an OCC error, for a task that is either just starting, or just finishing, and updating it's own task doc. And then the bulk operation also gets called. Each one is getting the task doc, and then updating it. Probably pretty rare, but the failures have been fairly rare.
The text was updated successfully, but these errors were encountered: